1 Introduction

The analysis and manipulation of matrices lie at the core of many mathematical and computational fields. Linear algebra, in particular, forms the foundation of numerous disciplines, including physics, engineering, and economics. Solving systems of linear equations, determining matrix inverses, and performing numerical computations are among the essential tasks in these domains. Matrix factorization techniques have emerged as powerful tools to simplify such tasks by decomposing a given matrix into more manageable and informative components. One such widely used matrix factorization method is the LU decomposition. In this paper, we focus on the Doolittle LU decomposition, which expresses a given matrix A as a product of a unit lower triangular matrix L and an upper triangular matrix U. This variant of LU decomposition is standard in most linear algebra packages and provides efficient solutions to systems of linear equations, determinants, and matrix inverses. Note that there are other LU decomposition variants, such as the Crout LU decomposition, which uses a unit upper triangular matrix U.

In recent years, we have witnessed the emergence of reduced-precision formats, such as the 16-bit floating-point format (FP16). These reduced-precision formats have gained popularity due to their ability to accelerate computations and reduce memory requirements, particularly in deep learning. However, for general high-performance computing (HPC) applications, the adoption of reduced precision poses new challenges for traditional algorithms such as the LU decomposition, where high accuracy is usually needed. Consequently, there have been efforts to explore and utilize the capabilities of mixed-precision calculations in computational tasks. One significant challenge has been the conversion of matrices from single or double precision to half precision, given the narrower range of the target format. Higham et al. [1] tackled this issue by proposing a method to achieve such conversions. Additionally, Haidar et al. [2] conducted an investigation demonstrating the potential benefits of low-precision floating-point arithmetic in HPC applications, specifically for solving a system of linear equations characterized by a large dense coefficient matrix (A), which requires double-precision accuracy. To achieve this accuracy, they developed a framework based on mixed-precision iterative refinement, which builds upon prior advances. Their method utilizes FP16 accuracy in the trailing update (TU) process and performs panel factorization (PF) in FP32. Dongarra et al. [3] explored the implications of leveraging different floating-point arithmetic precisions and optimizing communication costs in the context of the recent generation of supercomputers.

Furthermore, Higham et al. [4] introduced a technique for obtaining accurate and stable solutions to linear systems or least squares problems using a Cholesky factorization computed at low precision. Abdelfattah et al. [5] developed an FP16-accelerated dense linear solver for symmetric positive definite (SPD) systems. Their method involves integrating a mixed-precision Cholesky factorization with iterative refinement algorithms, resulting in double-precision accuracy.

Memory-bound algorithms have been a focus of investigation as well. Anzt et al. [6] showed that adjusting the length of exponents based on data range can significantly reduce memory footprint and resource consumption.

Huang et al. [7] provide a detailed analysis of Gaussian Elimination with Partial Pivoting, focusing on its numerical stability. Their results demonstrate that for a random \(n \times n\) Gaussian matrix \(A\), the growth factor is polynomially bounded, i.e., \(\rho (A) \le n^C\), where \(C > 0\) is an unspecified constant. This implies that solving a system \(Ax = b\) using Gaussian Elimination with Partial Pivoting requires \(m + O(\log n)\) bits of precision, improving upon previous estimates of \(m + O(\log ^2 n)\). Additionally, they establish that when the matrix dimension \(n\) is sufficiently large, the probability of successful execution of Gaussian Elimination with Partial Pivoting in floating-point arithmetic is at least \(1 - u^{1/8} n^{\tilde{C}}\), where \(u\) represents the unit roundoff and \(\tilde{C}\) is a constant. The exact value of \(\tilde{C}\) depends on the growth factor and remains an open problem.

The method of block elimination with additive modifications (BEAMs) proposed by Lindquist et al. [8] has emerged as an alternative to traditional Gaussian Elimination with Partial Pivoting. BEAM enhances the LU factorization process by applying modifications to entire diagonal blocks rather than individual elements, which allows for greater exploitation of nonlocal numerical properties and improved arithmetic intensity. This approach not only reduces the performance overheads associated with pivoting but also retains the numerical stability typically achieved in classic LU implementations. The method’s effectiveness is further enhanced by parameter selection, particularly regarding the tolerance levels and the application of the Woodbury formula, which collectively influence the performance and accuracy of the factorization process.

An innovative approach to mixed-precision LU factorization, based on a left-looking scheme that does not involve pivoting, was proposed by López et al. [9]. The authors demonstrated that their methodology has less data movement and memory footprint, potentially leading to improved speed.

Similarly, our work exploits reduced FP16 [-FP32] precision in search of higher performance. However, contrary to other works, we aim to generate L and U factors that retain FP64 accuracy and general applicability thanks to the application of pivoting. Depending on the precision used for each method and the part of the matrix to which the techniques are applied, we have achieved different levels of accuracy. Notably, one of these methods attains accuracy identical to FP64, while the others approach FP64 accuracy to varying degrees. The implementation of these techniques was initially driven by the limitations of CPU hardware in supporting half-precision arithmetic. At the time, our CPUs lacked native support for half-precision computations, which necessitated the use of GPUs for effective implementation. However, with recent advancements in CPU technology, there is now improved support for half-precision computations.

1.1 Contributions

In this paper, our central focus revolves around examining the application of reduced-precision FP16[-FP32] in the context of FP64 LU decomposition, and we delve into an assessment of the resultant solution’s performance and accuracy. In more detail, it includes the following:

  • Present a mixed-precision algorithm based on pre-pivoting the matrix (PRP) which 1) executes an LU in low precision; 2) uses the pivot list produced in the first stage to permute the matrix that stores data in high precision (FP64); and finally 3) performs an LU without pivoting (NPV) in high precision (FP64).

  • Introduce an alternative approach capable of producing an LU factorization in FP64, while internally employing Mixed-Precision Panel Factorization (MPF).

  • Create high-performance GPU-only (native) implementations of the LU factorization using half-precision floating-point arithmetic (FP16) or mixed FP16-FP32 precision which include auxiliary functions such as Half-precision TRiangular Solve Matrix (HTRSM).

  • Implement a high-performance native double-precision LU factorization without pivoting (NPV).

  • Evaluate the accuracy of all the implementations described above using several metrics. In this way, a potential user interested in their performance can decide if the level of accuracy that a particular implementation can offer is enough for their needs.

The paper is structured as follows: Sect. 2 provides a review of floating-point arithmetic accuracy and precision. Additionally, it discusses the LU factorization and the influence of pivoting on the LU decomposition. In Sect. 3, we introduce our novel algorithms designed to accelerate the LU factorization, while Sect. 5 discusses the essential auxiliary functions required for implementing the proposed algorithms. Section 6 begins with a detailed description of the hardware setup utilized in our research, followed by an exploration of the metrics employed to assess the accuracy of our proposed methods. It concludes by evaluating the performance of both proposed algorithms and auxiliary functions in hybrid (CPU + GPU) and native (GPU) modes. Finally, in Sect. 7, we provide a comprehensive summary of our findings and comment on potential extensions of this work.

2 Background

This section highlights key themes: accuracy and precision in numerical computations; the foundational LU decomposition technique; and pivoting methods crucial for process stability and efficiency.

2.1 Accuracy and precision

The IEEE Standard for floating-point arithmetic [10] has evolved, and it has accommodated a new format with only 16 bits [11], divided into a 1-bit sign indicator, a 5-bit biased exponent, and a 10-bit fraction. Google has also introduced the BFloat16 (brain floating point) format, a 16-bit version of the 32-bit IEEE 754 standard. Bfloat16 retains the approximate dynamic range of 32-bit floating-point numbers but offers less precision compared to FP32 and FP16. This is because bfloat16 uses 7 bits for the significant, in contrast to FP32’s 23 bits and FP16’s 10 bits. Converting FP64 to FP16 involves scaling down range and precision, utilizing rounding modes like Round-to-Nearest-Even, Round-Down, and Round-Toward-Zero. Values exceeding the maximum representable FP16 value are clamped to prevent overflow, while extremely small numbers are set to the smallest subnormal FP16 value or zero.

In computation-intensive applications such as deep learning, utilizing half precision offers significant performance benefits. With a relative accuracy of approximately \(10^{-4}\), half precision is adequate for training neural networks. This allows efficient use of computational resources without substantial precision trade-offs.

In general, half precision is particularly advantageous for applications that only require a low dynamic range. However, many areas of science and engineering need higher precision, in single or double precision, or even extended precision formats to minimize rounding errors. In software, support for arbitrary-precision arithmetic or multiple-precision arithmetic is available through libraries like GMP [12], MPFR [13], and FloatX [14]. Its primary constraint is the amount of available memory on the host system. Such arbitrary precision is used where the speed of calculation is not a concern and more precise results are required. Hardware support for multiple-precision arithmetic is not yet widespread, but it can be found in both off-the-shelf systems [15] and ad hoc devices [16, 17].

In recent times, certain processors have integrated native support for half-precision data types.

Several modern programming languages now provide support for FP16 operations. In C++, the std::float16_t type was introduced in the C++23 standard, and in C, the _Float16 is available through compiler support. CUDA provides the __half and __half2 types, along with intrinsic functions for FP16 arithmetic operations. Fortran also has the type since the 2023 standard. Hardware support for FP16 varies across architectures. In general, support from the CPU side is not as good as the GPU, and most CPUs provide support for conversion and storage rather than arithmetic computation. Intel has introduced FP16 support in some recent CPU architectures, such as Sapphire Rapids, through AVX512-FP16 instructions. IBM POWER9 and later processors include native FP16 support for storage and conversion. For Arm-based systems: half precision is supported in Armv8.2 and later Application profile architectures. For arm-based systems, GCC supports half-precision floating point via the __fp16 type. ARM uses two incompatible representations for half precision: IEEE 754-2008 format and the ARM alternative format. The ARM format extends the range of exponents but does not support infinities or NaNs. The __fp16 type is for storage only; for arithmetic and other operations, __fp16 values are automatically promoted to float. RISC-V, while newer to the scene, has also defined extensions for FP16 operations including the Zfh and Zfhmin, though hardware implementation may vary among vendors.

Notably, NVIDIA GPUs emerged as one of the early commercially available devices that extensively improved their backing for half-precision execution. Moreover, NVIDIA introduced the Tensor Core unit, a specialized hardware component with a focus on providing an extra performance boost for General Matrix Multiplications (GEMM) when operating in mixed-precision mode. This innovation has significantly enhanced the capabilities and efficiency of computational tasks on these devices.

2.2 LU factorization

According to Golub and Van Loan [18], “LU factorization is a high level algebraic description of Gaussian Elimination.” Its goal is to transform a given system of linear equations, Ax = b, into a simpler system which writes the coefficient matrix as the product of a lower triangular matrix (L) and an upper triangular matrix (U). By using them, the solution to a linear system of equations can be obtained for multiple right-hand sides. Thus, the procedure has two phases [19]. First, convert the given system into an upper triangular equivalent, a process commonly referred to as the Sweep-Out Phase (SWOP). Following this, the second phase involves recovering the solution vector in reverse order, known as the Back-Substitution Phase (BSP).

Several widely used libraries in the field of scientific computing such as LAPACK  [20], Eigen  [21], and MAGMA  [22] provide efficient implementations of the LU decomposition. The standard implementations of the LU factorization available in most libraries employ Partial Pivoting (PP) with row interchanges. They follow the convention of naming such routines as ?GETRF, where ? is replaced by a character that indicates working precision.

In this paper, our research centers on implementing LU decomposition within the GPU. For result comparison, we have chosen MAGMA as our baseline benchmark.

2.3 Pivoting strategy

In order to control errors during the SWOP and improve stability, numerical analysis has focused on studying various aspects, including growth factor, backward error analysis, condition number, scaling, and pivoting. Stabilization primarily relies on pivoting, which ensures the elimination process avoids nonzero pivots at every step. This might imply swapping rows and/or columns to move a suitable pivot to the diagonal. This process is often represented geometrically as a transformation or rotation of the matrix, which changes the position of the rows and columns but not their values. Although one can find some cases for which pivoting is not necessary [23], they are scarce. In addition, pivoting gives us the chance to make the algorithm backward stable. In each step of the elimination process, the elements below the diagonal are divided by the pivot element in the diagonal. So, by doing this operation with a floating point operation it makes sense to divide by a large element to reduce roundoff error. Of concern is the growth of elements, since dividing by a small quantity can give large results. Various methods of pivoting are explored in [18], including Partial Pivoting, Complete Pivoting, Rook Pivoting, and Scaled Partial Pivoting. The simplest among these methods is Partial Pivoting (PP). While there may be theoretical concerns or extreme cases where PP may encounter issues, in practice, it is a reliable and robust technique that provides satisfactory results for most numerical analysts.

Wilkinson’s papers, such as “Error Analysis of Direct Methods of Matrix Inversion” [24] and “Modern Error Analysis” [25], demonstrated that Gaussian Elimination with PP is normwise backward stable. Higham provides a comprehensive summary of this in his report “How Accurate is Gaussian Elimination?” [26]. This theoretical background confirms the robustness of PP in practical applications.

3 Proposed algorithms

In this section, we introduce innovative algorithms designed to optimize the LU decomposition with PP, placing a primary emphasis on minimizing data movement and enhancing overall computational speed. The proposed algorithms, are denoted as “Pre-Pivoted LU ” and “LU with Mixed-Precision Panel Factorization.” The following sections provide detailed examinations of each algorithm.

3.1 Pre-pivoted LU (PRP) algorithm

After performing the LU factorization with partial pivoting (PP) on matrix A, the resulting L and U matrices, alongside the permutation matrix P, can be formulated as PA = LU. Matrix P maintains the sequence of row swaps that occur.

The concept of Pre-Pivoting (PRP) involves pre-calculating the permutation matrix P using reduced precision, which is then applied to matrix A before computing the LU factorization in double precision. By adopting this approach, the need for pivoting during LU factorization is reduced (see Fig. 1). However, it is important to note that the pre-calculated pivots in reduced precision may not always align with those obtained in double precision with PP, depending on factors such as matrix size and the type of reduced precision used. Our experiments, detailed in Sect. 6.3, show that these factors can lead to varying accuracies in the solution of linear equations.

In this context, the determination of the permutation matrix P, stored as a vector called IPIV, is carried out by performing the LU factorization either in pure half FP16 precision (via hgetrf in native mode) or mixed FP16-FP32 precision (via xgetrf routine in native mode and xshgetrf routine in hybrid mode). The final step involves swapping the rows of the original matrix in FP64 using LAPACK routine laswap, culminating in LU factorization without pivoting. Depending on the hardware configuration, the final step utilizes routine npv in native mode or magmadgetrfnopiv in hybrid mode. The performance of each mentioned auxiliary routine here will be discussed in Sect. 6.8, detailing their behavior and relative speed.

The uniqueness of LU decomposition, as discussed in Higham’s book [27], asserts that LU factorization is not necessarily unique unless certain conditions are met. Specifically:

Theorem 1

(Nonuniqueness of LU Decomposition [27]) There exists a unique LU factorization of \(A \in \mathbb {R}^{n \times n}\) if and only if \(A_k\) is nonsingular for \(k = 1, \ldots , n-1\). \(A_k\) is the \(k \times k\) submatrix obtained by taking the first \(k\) rows and the first \(k\) columns of \(A\). If \(A_k\) is singular for some \(1 \le k \le n-1\), then the factorization may exist, but if so, it is not unique.

Therefore, while our proposed method using pre-pivoting does not yield the exact LU decomposition with PP in double precision, this nonuniqueness is inherent to LU factorization and is well-documented in the literature. Additionally, as discussed in Sect. 2.3, there are various methods for pivoting that can result in different LU decompositions.

Fig. 1
figure 1

PRP steps illustrated through a representative 4x4 matrix example: LU in FP16[-FP32] + LASWP + LU without Pivoting in FP64

Algorithm 1
figure a

Half Precision (HPRP) Variant of PRP

Algorithm 2
figure b

Mixed-precision (XPRP) Variant of PRP

Algorithm 1 shows the two variants of PRP. The HPRP variant identifies the pivot list solely in FP16 and applies it to the original matrix. However, the XPRP variant determines the pivot list using a combination of FP16 and FP32. \(A_h\) and \(A_s\) denote the use of auxiliary storage used to hold a copy of the input matrix in half (FP16) or single (FP32) floating-point format.

To convert double-precision floating-point values to half precision (FP16) within CUDA, the conversion follows these steps:

  1. 1.

    Conversion to Single Precision: The double-precision value is first convert to single precision (float).

  2. 2.

    Conversion to Half Precision: The single-precision value is then converted to half precision using CUDA’s intrinsic __float2half_rn, which ensures rounding to the nearest even number.

  3. 3.

    Overflow Handling: Values exceeding the maximum representable FP16 value are clamped to this maximum to prevent overflow.

  4. 4.

    Underflow Handling: Values smaller than the smallest positive normal FP16 value are mapped to zero.

For some versions of the compiler, we have the ability to convert directly from FP64 to FP16.

3.2 LU with mixed-precision panel factorization (MPF)

Next, we introduce the MPF algorithm for performing the LU factorization: the DGETRF implementation is modified so that the standard PF with PP in FP64 is replaced by the HPRP algorithm. In this algorithm, we employ the DGETF2_NATIVE_NPV and HGETF2 routines, which are both designed to operate on a panel, unlike the NPV and HGETRF routines that operate at a higher level and internally invoke panel factorization routines. We initially determine the pivot list for the current panel using FP16 (HGETF2). Subsequently, after applying the pivots once, we perform the PF a second time in FP64, this time without pivoting (DGETF2_NATIVE_NPV). The entire process followed in order to factorize a panel (P) is illustrated in Fig.  2. Details are explained in Sect. 5.2.

Fig. 2
figure 2

MPF: Steps in PF performed via HPRP with a panel size of 4x1

The panel’s width is constrained to a maximum of 1024 columns, which is derived from the maximum size of the panel width in MAGMA. Opting for HPRP as a substitute for the standard PF implies that, for each undergoing panel factorization, the data are initially converted from FP64 to FP16 and stored in an auxiliary buffer (P\('\)). This approach restricts the accumulation of roundoff errors to within each panel, rather than compounding throughout the entirety of the matrix width, a scenario encountered when employing HGETRF or PRP over the entire matrix. Consequently, this procedure should result in higher accuracy than those of HPRP and XPRP, which are applied to the complete matrix, and come close to matching the precision achieved by DGETRF.

4 Time complexity analysis

In this section, we will explore the time complexity of the proposed algorithms. We start by examining the HPRP and XPRP algorithms.

4.1 HPRP and XPRP Algorithms

Both the HPRP and XPRP algorithms include the following steps:

  1. 1.

    Perform an LU decomposition (with pivoting) on an \(N \times N\) reduced-precision matrix.

  2. 2.

    Execute a row swap for all rows.

  3. 3.

    Conduct a second LU decomposition (without pivoting) on the double matrix.

The time complexity for each LU decomposition is:

$$\begin{aligned} O\left( N^3\right) \end{aligned}$$

The time complexity for the row swap is:

$$\begin{aligned} O(N^2) \end{aligned}$$

Therefore, the total time complexity, combining all steps, is:

$$\begin{aligned} T_{\text {PRP}}(n) \approx O\left( N^3\right) + O\left( N^3\right) + O\left( N^2\right) \approx O\left( N^3\right) \end{aligned}$$

4.2 MPF Algorithm

The MPF algorithm combines panel factorization steps in half precision (with pivoting) and double precision (without pivoting). To determine the total time complexity, we consider the following components:

  • Panel factorization in half precision: \(T_{\text {half}}(n)\)

  • Matrix factorization in double precision: \(T_{\text {double}}(n)\)

Total time complexity:

$$\begin{aligned} T_{\text {MPF}}(n) = T_{\text {half}}(n) + T_{\text {double}}(n) \end{aligned}$$

For an \(M \times N\) matrix, divided into panels of width \(r\), the time complexity for each panel is:

$$\begin{aligned} O\left( \frac{2}{3}(M - ir)r^2\right) \end{aligned}$$

Summing for all panels:

$$\begin{aligned} T_{\text {half}}(n) = \frac{2}{3} r^2 \sum _{i=0}^{\frac{N}{r} - 1} (M - ir) \end{aligned}$$

Simplifying:

$$\begin{aligned} T_{\text {half}}(n) = \frac{2}{3} Nr (2M - N + r) \end{aligned}$$

Assuming \(r\) is much smaller than \(M\) and \(N\):

$$\begin{aligned} T_{\text {half}}(n) \approx NrM \end{aligned}$$

Since \(M = N\), we can simplify the expression for the half-time \(T_{\text {half}}(n)\).

By substituting \(M = N\), this becomes:

$$\begin{aligned} T_{\text {half}}(n) \approx O(N^2r) \end{aligned}$$

As mentioned earlier, the time complexity for LU decomposition is:

$$\begin{aligned} T_{\text {double}}(n) \approx O\left( N^3\right) \end{aligned}$$

Since we have pivoting at each step, we can consider it as a double LU decomposition with pivoting, where the time complexity is also equal to the LU without pivoting.

For a square \(N \times N\) matrix:

$$\begin{aligned} T_{\text {MPF}}(n) \approx O\left( N^2 r\right) + O\left( N^3\right) \approx O\left( N^3\right) \end{aligned}$$

The overall time complexity of the MPF algorithm is:

$$\begin{aligned} T_{\text {MPF}}(n) \approx O\left( N^3\right) \end{aligned}$$

5 Implementation

In this section, we provide details on the implementation of PRP and MPF, covering key components including LU without pivoting, computational setup (GPU-only or hybrid with CPU), resource requirements, and the implementation of look ahead (LA).

5.1 PRP

As indicated in Sect. 3, PRP requires auxiliary storage to hold a copy of the input matrix in half (FP16) or single (FP32) floating-point format, depending on the variant (\(A_h\) and \(A_s\) in algorithm 1 and 2). Once the matrix is stored, the PRP algorithm proceeds with three main components: finding the pivot list in low precision, applying row swaps at once, and performing the LU factorization without pivoting in FP64. For the second stage, we use the magma_dlaswp_columnserial which has been introduced by MAGMA. During the LU decomposition process, the column-major layout offers an advantage for finding the pivot element, while the row-major layout aids significantly in applying the swap. Addressing this challenge, the MAGMA team has introduced a design to enhance the pivot mechanism [28]. This approach maintains the matrix in transposed order, enabling column swapping for pivoting. By transposing the panel back in each iteration, the search for the pivot element follows a column-major pattern. This advancement enhances computational efficiency and memory access patterns throughout the factorization process. Therefore, in this section, we will look at the implementation specifics of the remaining two components: finding the pivot list and executing the LU factorization without pivoting.

5.1.1 LU decomposition with PP in low precision

Given the GPU’s efficiency in executing GEMM, offloading the TU (Trailing Update) and TRSM (TRiangular Solve Matrix) tasks to it proves beneficial. However, the remaining PF operation, being largely sequential, becomes a bottleneck in the LU decomposition process. To mitigate its computational cost, several approaches are considered, including overlapping it with other operations by the cooperation of GPU and CPU or employing reduced precision. While only the FP32 as a reduced precision is available from the host side, the GPU also offers full support for FP16. The potential combination of these methods is discussed below.

Hybrid

MAGMA’s design includes a hybrid mode that handles PF on the CPU, but offloads subsequent operations to the GPU. In the absence of FP16 arithmetic computing support on our CPU, this method uses FP32 precision during PF. However, the new 4th-generation Intel Xeon scalable processor's ISA supports a wide range of general-purpose FP16 numeric operations. Based on our hardware configuration, we will stick to the XPRP variant of the PRP as described in algorithm 2 in the hybrid mode. The PF is computed on the CPU using the LAPACK routine SGETRF, while the STRSM (Single-precision TRiangular Solve Matrix) and HGEMM (Half-precision GEneral Matrix Multiplication) are computed on the GPU.

Native

The native mode performs all computations exclusively on the GPU after transferring the input matrix to the device memory. We implemented two variants in the native mode: a fully FP16 version (HGETRF) and a mixed FP16-FP32 version (XGETRF). For (HGETRF), we developed FP16 implementations of PF (HGETF2) and HTRSM, while utilizing cuBLAS for the TU. To optimize performance, we pre-allocate a buffer at the start of HGETRF for use in HTRSM. For XGETRF, we used MAGMA’s FP32 PF and STRSM for GPU-only execution, combining them with cuBLAS’s FP16 TU. Both implementations incorporate a look-ahead (LA) technique using two CUDA streams, enabling the potential of concurrent execution of the PF alongside the Level 3 BLAS operations within the TU from the preceding iteration [29]. The rationale for implementing both variants is to offer options balancing performance and numerical stability: HGETRF provides maximum speed, while XGETRF offers a compromise between FP16 speed and FP32 accuracy, particularly beneficial for matrices requiring higher precision in PF. We developed these routines as high-performance native implementations were not available in MAGMA for the fully FP16 version, while for XGETRF, we leveraged and adapted existing MAGMA components.

5.1.2 A double-precision LU without pivoting (NPV)

In order to provide a fast PRP algorithm, a high-performance implementation of the FP64 LU without pivoting (NPV) is essential. MAGMA offers a specialized routine called DGETRF_NOPIV, designed for LU factorization without pivoting in hybrid mode. This approach is particularly sensitive to hardware configurations, often leading to GPU idle times as it waits for data from the CPU. Our starting point for implementing the native version was the hybrid version of DGETRF in MAGMA, from which we removed all parts related to the search for a pivot, performing row swaps, and managing the pivot list.

Note that the process of performing PF is mostly sequential. The NPV, as used here, does not involve row swaps, allowing us to divide each PF into two distinct parts. First, we operate on a square block located in the upper part of the current panel, representing the square diagonal block in the context of the whole matrix to which this panel belongs. We apply the NPV algorithm to factorize it into L and U. Next, we focus on the remaining rows below the diagonal block that needs to be updated. By calling DTRSM to update it, we can take advantage of cache reuse and parallel execution in BLAS-3.

Let us elaborate upon the implementation of the PF of the diagonal block. On the one hand, we implemented both recursive and blocked versions. In numerical linear algebra, a blocked algorithm is a computational approach that organizes a computation so that it works on contiguous chunks of data, and a recursive algorithm is a method that reduces the problem size recursively until it fits into progressively smaller cache levels see Section 3.2.11 of [18] for more details of both approaches. However, in the context of GPUs, recursive algorithms can lead to high register usage, potentially causing register spilling to slower local memory. Additionally, recursion requires managing a call stack in local memory, adding overhead, and increasing memory access times. Unlike blocked algorithms, recursion often cannot be effectively inlined, resulting in additional function call overhead and performance penalties. However, since we are following the MAGMA hybrid approach of launching CUDA kernels from the CPU, the recursion is implemented in the CPU code. The potential reasons for the better performance of the blocked code could be the more efficient memory access patterns, reduced need for synchronization, and the simpler execution flow that minimizes overhead and maximizes resource utilization.

On the other hand, we explored two possibilities for optimizing the CUDA kernel operating on micro-panels: using shared memory, for which 64 turned out to be the best micro-panel width; or exploiting intra-warp thread-level synchronization through warp shuffle instructions, by setting the micro-panel width to 32 in alignment with the warp size in CUDA, aiming at the minimization of intra-warp data movement and synchronization. The impact of each criterion will be shown in Subsection 6.8.2.

Algorithm 3 and 4 present pseudocode for both implementations. It is worth mentioning that both versions are inspired by the MAGMA codes.

Algorithm 3
figure c

Shared Memory Approach

Algorithm 4
figure d

Warp Shuffling Approach

5.2 MPF

MPF replaces the standard PF with HPRP, introducing additional resource requirements but also opening up new possibilities for parallelism. This section delves into the fundamental principles of MPF, examining both its resource demands and the strategies utilized to leverage parallel opportunities.

Resource requirements

Two buffers are needed by HPRP: one for undergoing panel factorization in FP16 and the other for performing NPV PF in FP64. This second buffer is used since both NPV and DTRSM show higher performance when working on a panel stored using column-wise storage compared to working in place on the matrix stored row-wise. These buffers are allocated once, initially, and reused for each panel factorization, i.e., for each call to HPRP, throughout the whole LU factorization. The buffer size dedicated to processing each panel equals the maximum panel size, which is \(N \times r\), where \(N\) represents the matrix size and \(r\) denotes the panel width. Similarly, the buffer size for the HTRSM operation is also \(N \times r\).

Enhancing parallel execution with look ahead

We have three versions of the MPF. One does not perform LA when executing the PF. We refer to this variant as MPF. The other two implement LA in different ways. The matrix partition layout for the LA-enabled MPF and its associated routines is illustrated in Fig. 3. In that Figure, the FP16 panel is depicted as the uppermost layer in the figure, distinguished by its blue hue.

Fig. 3
figure 3

Partitioned Layout of the matrix for LA-Enabled Versions of MPF

The \(A^{H}_{0,0}\), \(A^{H}_{1,0}\) and \(A^{H}_{2,0}\) blocks contain a copy of \(A_{0,0}\), \(A_{1,0}\) and \(A_{2,0}\) in FP16. These three reduced-precision blocks are collectively called a single panel and passed to the HGETF2 to provide the pivot list. It will be shown in Sect. 6 that there is limited overlap of the execution of the PF in the next iteration and the TU in the current iteration when executing natively on a GPU. Thus, we will leverage the split of the PF without pivoting via HPRP into two parts: the first one updating the diagonal block \(A_{0,0}\) via DGETF2_NATIVE_NPV; the second one updating the block \(A_{1,0}\) and \(A_{2,0}\) via DTRSM. The first variant, named MPF + LA, tries to overlap the PF of the next iteration in FP16 and the update of the diagonal block \(A_{0,0}\), via HPRP, with the TU of the current iteration. This variant of MPF operates with two distinct streams. Stream 1 is dedicated to HGETF2 and also it manages the execution of DGETF2_NATIVE_NPV, LASWP, DTRSM applied to \(A_{0,1}\), as well as TU of \(A_{1,1}\) and \(A_{2,1}\). Meanwhile, Stream 2 is exclusively for managing the DTRSM operations that act on \(A_{1,0}\), \(A_{2,0}\) and \(A_{0,2}\). Refer to Fig. 4 for a visual representation of these details.

Fig. 4
figure 4

CUDA Streams Layout in MPF + LA

We implement a variant named MPF + Custom LA, which refers to the case in which the DTRSM applied to the block \(A_{1,0}\) and \(A_{2,0}\) of the next panel is not included in the LA, i.e., the LA is limited to computing the PF in FP16, followed by the LASWP and the DGETF2_NATIVE_NPV of the block \(A_{0,0}\). In this way, the DTRSM operations dedicated to blocks \(A_{0,1}\) and \(A_{0,2}\) for the next iteration, which depend on the block \(A_{0,0}\) but not block \(A_{1,0}\) and \(A_{2,0}\), can go in parallel with the DTRSM dedicated to block \(A_{1,0}\) and \(A_{2,0}\) of the current iteration. This means that focusing on the DTRSM is necessary to update the lower part of the panel for the next iteration, instead of trying to overlap it with the TU of the current iteration.

The custom version of the algorithm uses three streams. This version follows the same as shown in Fig. 4 and has an additional stream dedicated to the DTRSM of block \(A_{1,0}\) and \(A_{2,0}\). In that figure, each swap shown is applied to the entire matrix.

6 Evaluation

In this section, we perform a detailed evaluation of the above-mentioned algorithms. This evaluation aims to provide an understanding of their accuracy and performance. A brief overview of the proposed algorithms is given below:

  1. 1.

    HPRP

    • Pre-calculate pivot in FP16 and apply it to the double version of the matrix, then perform NPV over it.

    • Precision Used: FP16/FP64

  2. 2.

    XPRP

    • Pre-calculate pivot in mixed FP16-FP32 and apply it to the double version of the matrix, then perform NPV over it.

    • Precision Used: FP16/FP32/FP64

  3. 3.

    MPF

    • Pre-calculate pivot in FP16 for each panel and apply it to the double version of the matrix, then perform NPV over the matrix.

    • Precision Used: FP16/FP64

We begin by describing the experimental setup, followed by a discussion on the metrics used to evaluate the methods presented in this work. We continue by analyzing the accuracy of each algorithm. Finally, we explore the efficiency of these algorithms as well as the auxiliary routines that play a crucial role in implementing each algorithm.

6.1 Experimental setup

On the hardware side, we have carried out our research using three different NVIDIA GPUs with two different architectures. On the one hand, the V100s [30] is representative of the Volta GV100 GPU architecture. On the other hand, the A100 [31] and the RTX3090 [32] belong to the Ampere GPU architecture. The A100 and RTX3090 GPUs are attached to a system using an Intel Xeon Silver 4210R processor equipped with 128 GB of RAM. The V100s is plugged into a host with an Intel Xeon Gold 6126 CPU and 64GB of RAM. The specifications of each GPU, including the NVIDIA V100S, A100, and RTX3090, are shown in Table 1, where CC indicates CUDA cores and TC indicates Tensor cores.

Table 1 Hardware Configuration

For the software, we utilized version 2.5.4 of MAGMA as our baseline code for developing experiments and testing performance. The compilers used on the machine with the A100 and the RTX3090 were “gcc 9.3.0” and “nvcc V11.3.58,” respectively. Additionally, we used “cuBLAS 11.4.2.10064” for these GPUs. For the machine equipped with the V100s, we used “gcc 9.4.0” and “nvcc V11.4.48” as the compilers, along with “cuBLAS 11.5.2.43”. Our research has heavily relied on NVIDIA Nsight Systems (nsys) [33] to identify performance bottlenecks and optimize the execution time of our GPU codes. We have also inspected some kernels with NVIDIA Nsight Compute (ncu) [34].

In our experiments, we have used matrices initialized with pseudo-random numbers following a normal distribution. In the text, we will refer to them as randn. For the sake of completeness, Fig. 5 provides their condition number since this gives hints on how well or badly conditioned the matrix is with respect to the linear algebra problem.

Fig. 5
figure 5

Condition number of the test matrices used in our experimental setup

6.2 Metrics

We evaluate the performance of our proposed method using two main metrics: TFLOPS (Tera Floating-point Operations Per Second) rate and speedup. These metrics allow us to assess the improvement achieved by our approach over the baseline, which is DGETRF within MAGMA.

To evaluate the accuracy of our variants of the LU decomposition and the linear system of equations being solved, we present a set of metrics that we will use in this work. These metrics involve using vector and matrix norms.

The first metric, shown in Eq. (1), measures the relative difference between the product of the triangular factors (L and U) resulting from the LU decomposition being computed and the matrix A after row permutation using the permutation matrix (P).

$$\begin{aligned} \frac{\Vert PA - LU\Vert _F}{n \Vert A\Vert _F} \end{aligned}$$
(1)

Another metric of use is the relative norm of the residual, for the solution found (\(\hat{x}\)),

$$\begin{aligned} \frac{\Vert b - Ax\Vert _F}{n \Vert A\Vert _F \Vert x\Vert _F} \end{aligned}$$
(2)

. However, small residuals do not necessarily imply high accuracy [18]. Thus, we include a third metric to assess the recovery of a true solution for a system of linear equations if we know the solution. We can do so by defining the true solution x, for instance, to be a set of numbers randomly generated (“x=randn(n)”, being n the dimension of the problem) or a vector of ones (“x=ones(n)”). In either case, then one can use A and x to generate the right-hand side b (\(b=A \cdot x\)). Thus, we can solve the system and find the computed \(\hat{x}\) (\(\hat{x}=A \backslash b\)) using some numerical method. In this paper, a vector of ones is used as the solution vector. Then, we use Eq. (3), which is usually referred to as the relative forward error, as discussed by Higham [27] (Section 1.2) and Golub and Van Loan [18] (Section 3.5), to check the accuracy of the solution found (\(\hat{x}\)) against the true solution (x):

$$\begin{aligned} \frac{\Vert \hat{x}-x\Vert _{\infty }}{\Vert x\Vert _{\infty }} \end{aligned}$$
(3)

6.3 Accuracy evaluation of HPRP, XPRP, and MPF

In this section, we will analyze the accuracy of the LU decomposition obtained by the algorithms proposed in Sect. 3. Each method can potentially provide a different pivot list, which yields a different LU decomposition, something compatible with the theorem of nonunique LU factorization as mentioned earlier in Theorem 1. Based on Heuristic II on page 138 of [18] if the unit roundoff and condition number of the problem meet the criteria \(u \approx 10^{-d}\) and \(kappa \approx 10^{q}\), respectively, then the Gaussian Elimination will yield a solution x with an approximate accuracy of \(d - q\) decimal digits. Consequently, HGETRF will only be accurate for specific problems with very limited dimensions or low condition numbers.

Limiting reduced-precision computation to specific segments of the data helps mitigate error propagation. Introducing an intermediate computation stage in double precision in the MPF acts as a catalytic converter against error propagation. For subsequent panels, this approach effectively resets the potential accumulation of reduced-precision errors. In contrast, HPRP and XPRP methods perform reduced-precision computations over the entire matrix without this intermediate step. As a result, errors introduced during the panel factorization are propagated into the subsequent update phase and affect the subsequently provided pivot vector. This difference in approach highlights how the presence of an intermediate computation stage in double precision can enhance stability and mitigate the cumulative effects of reduced-precision computation in the MPF algorithms.

Considering Eqs. (1) and (2), in Figs. 6 and 7, we observe that both HPRP and XPRP experience a significant accuracy reduction compared to DGETRF, which implements a full FP64 LU factorization. This decrease becomes more noticeable as the problem size increases.

Fig. 6
figure 6

Comparing \(\Vert PA-LU\Vert _F / n\Vert A\Vert _F\) for DGETRF, XPRP, HPRP, and MPF using a logarithmic scale

Fig. 7
figure 7

Comparing \(\Vert b-Ax\Vert _F / n\Vert A\Vert _F\Vert x\Vert _F\) for DGETRF, XPRP, HPRP, and MPF using a logarithmic scale

Generally speaking, the accuracy of HPRP aligns with DGETRF up to dimension 4k; and the similarity holds for XPRP for dimensions up to 24K. For MPF, we use HPRP only as a replacement for the standard PF. The accuracy obtained is of the same order of magnitude as for DGETRF.

Based on Eq. (3), Fig. 8 allows us to get a clear idea of whether the solution found is close or not to the true solution when operating on randn matrices of different dimensions. By employing this metric, we can observe that the accuracy achieved by MPF is equivalent to that of DGETRF, which operates entirely in FP64. Therefore, investigating the performance of this algorithm is worthwhile.

Fig. 8
figure 8

Comparison of \(\Vert \hat{x}-x\Vert _{\infty } / \Vert x\Vert _{\infty }\) for all our routines implementing the LU factorization in different precisions, using a logarithmic scale

6.4 Error analysis and limitations

The growth factor, denoted by \(\rho\), measures how much larger the elements of the matrix can become during LU factorization. For a matrix of dimension \(n\), the growth factor with partial pivoting is bounded by \(\rho < n^{2/3}\). When performing LU factorization using FP16 arithmetic, where the maximum representable value is 65504, it is crucial to ensure that the growth factor does not cause any matrix element to exceed this limit.

To determine the maximum matrix dimension \(n\) that FP16 can support, consider a panel \(P\)-a submatrix of width \(nb\) extracted from the original matrix \(A\). The pivot list is pre-calculated from the LU decomposition of \(P\), performed in FP16. The relationship between the maximum value in the panel after LU decomposition, \(\max _{i,j} |P_{ij}'|\), and the initial maximum value, \(\max _{i,j} |P_{ij}|\), is given by:

$$\begin{aligned} \max _{i,j} |P_{ij}'| = \max _{i,j} |P_{ij}| \times \rho \end{aligned}$$
(4)

Given the bound \(\rho < n^{2/3}\) and assuming the panel \(P\) is scaled such that \(\max _{i,j} |P_{ij}| = 1\), the maximum value after decomposition can be set to the FP16 maximum:

$$\begin{aligned} \max _{i,j} |P_{ij}'| = 65504 \end{aligned}$$

Thus, we have:

$$\begin{aligned} 1 \times n^{2/3} = 65504 \implies n = 65504^{3/2} \approx 16,764,929 \end{aligned}$$

This calculation suggests that, in a highly idealized scenario where the panel \(P\) is perfectly scaled, FP16 could support matrix dimensions up to approximately \(16,764,929\). However, this theoretical limit is rarely achievable in practice, as discussed below.

6.4.1 Practical considerations for panels in MPF

Several factors significantly constrain the practical application of LU factorization in FP16:

  1. 1.

    Panel Scaling (\(\max _{i,j} |P_{ij}| > 1\)): In practice, the maximum initial value \(\max _{i,j} |P_{ij}|\) in the panel \(P\) is often greater than 1, reducing the allowable dimension \(n\) to prevent overflow during factorization.

  2. 2.

    PRP Panel Growth Factor: In the PRP algorithm, the maximum panel width is 1024. We consider the growth factor \(\rho\) for LU decomposition:

    • For \(n = 1024\), the growth factor is approximately \(\rho = 1024^{2/3} \approx 101.594\).

    • To ensure that \(\max _{i,j} |P_{ij}'|\) does not exceed 65504, we solve:

      $$\begin{aligned} \max _{i,j} |P_{ij}| \times \rho \le 65504 \end{aligned}$$
    • Substituting \(\rho = 101.594\):

      $$\begin{aligned} \max _{i,j} |P_{ij}| \le \frac{65504}{101.594} \approx 644.765 \end{aligned}$$

    Therefore, the maximum possible value for \(\max _{i,j} |P_{ij}|\) before LU decomposition, to ensure that the maximum value after decomposition remains within the representable range of FP16, is approximately 644.765.

  3. 3.

    Practical Growth Factor Considerations: Given that the growth factor for partial pivoting in practice rarely exceeds 8 [24, 169], and assuming we are working with FP16 precision, the maximum safe initial value for elements in the matrix \(A\) to pre-calculate the pivot should be less than:

    $$\begin{aligned} \frac{65504}{8} = 8188. \end{aligned}$$

    This ensures that the computations remain within the representable range of FP16, thus avoiding numerical overflow.

  • Stability Bound for FP16 LU Factorization: A relevant theorem suggests that for any pivoting strategy, LU factorization remains stable as long as \(n \times u < 1\), where \(u\) is the unit roundoff error. For FP16 arithmetic, with \(u \approx 10^{-3}\), the maximum matrix dimension \(n\) for stability is 2048, which is twice the maximum panel width typically used in the MPF algorithm.

The maximum absolute value of each panel for each method (MPF and DGETRF) across the same matrix dimensions is shown in Fig. 9. In this figure, we can see that except for the last panel, the maximum value of each panel for MPF and DGETRF is of the same order of magnitude and does not execute the range supported by MPF. We should mention that, according to our observation, a large growth factor alone does not imply an unstable algorithm, as the provided results for the backward and forward error of MPF support this.

Fig. 9
figure 9

Maximum absolute value of panel elements in MPF and DGETRF for matrix sizes ranging from 1k to 40k

6.4.2 Practical considerations for determinant

According to Wilkinson [35], there is a widely held belief that the selection of large pivots is, in some respects, harmful. The argument is as follows: “The product of the pivots is equal to the determinant of \(A_0\) and is therefore independent of pivoting. If we select large pivots initially, then inevitably the later pivots must be smaller and therefore have higher relative errors.” Wilkinson also notes that the determinant can be computed as the product of the pivot elements obtained during LU factorization, even when partial pivoting is applied. This connection allows us to evaluate the quality of an algorithm’s pivot selection by comparing the determinants computed via different methods.

The following section presents an analysis of the determinants of various test matrices, employing the four methods of HPRP, XPRP, MPF, and DGETRF. An easy way to compute the determinant is the product of the diagonal elements (pivot elements) after LU decomposition, which in floating-point operation can yield very large or very small values, leading to numerical instability. To address this, the logarithm of the determinant is used in this paper, calculated as:

$$\begin{aligned} \text {log (det)} = \sum _{i=1}^{n} \log \left( \left| M_{ii} \right| \right) \end{aligned}$$
(5)

Where \(M_{ii}\) are the diagonal elements of the matrix after applying each method. The relative error, a crucial metric in comparing the accuracy of these methods, is computed as:

$$\begin{aligned} \text {Relative Error} = \frac{|\text {log(det)}_{\text {MPF}} - \text {log(det)}_{\textsf {DGETRF}}|}{|\text {log(det)}_{\text {DGETRF}}|} \end{aligned}$$

As shown in Fig. 10, HPRP generally exhibited the largest deviations from DGETRF, indicating potential numerical instability. XPRP showed smaller relative errors than HPRP and MPF for matrix sizes up to 10k, but for larger matrix sizes MPF showed superior stability. This suggests that MPF proves to be the most reliable and stable method as matrix size increases.

To provide a clearer understanding of the absolute errors associated with each method, Fig. 11 provides a heatmap of the absolute errors, with values ranging from \(0.00\) to \(5.39 \times 10^{-7}\). In the worst case: HPRP has a maximum absolute error of \(5.39 \times 10^{-7}\), XPRP has a maximum absolute error of \(1.40 \times 10^{-8}\), MPF shows the maximum absolute error of \(9.90 \times 10^{-10}\).

Fig. 10
figure 10

Heatmap comparison of relative errors of determinant computed by Eq. 5 for HPRP, XPRP, and MPF across different matrix sizes (1k, 4k, 10k, 20k, 30k, and 40k). The color intensity represents the magnitude of the relative errors, with MPF consistently showing the smallest relative errors, indicating its close approximation to the DGETRF method

Fig. 11
figure 11

Heatmap comparison of absolute errors of determinant computed by Eq. 5 for HPRP, XPRP, and MPF across different matrix sizes (1k, 4k, 10k, 20k, 30k, and, 40k). The color intensity represents the magnitude of the absolute errors, highlighting the stability and accuracy of each method

Following this comparative analysis, we present the detailed results for the MPF method compared to DGETRF in different matrix sizes but focused on each panel. Since the determinant over the rectangular matrix is not defined, we will use a semi-determinant as a terminology to show the product of the diagonal elements over the panel as illustrated in Fig. 12.

Fig. 12
figure 12

Illustration of the semi-determinant for a rectangular matrix. The highlighted panel shows a part of diagonal elements, and their product represents the semi-determinant

If the semi-determinant produced by MPF on each panel, closely matches that of DGETRF, it suggests that the quality of the pre-calculated pivot list in MPF is sufficient. We conducted tests on matrices of sizes \(n \in \{4k, 10k, 20k, 30k, 40k\}\) with corresponding panel widths \(\omega \in \{64, 64, 128, 512, 1024, 1024\}\), respectively, where \(k = 1024\). As previously stated, the logarithmic determinant formula is employed in this investigation. The same approach is also utilized for panels.

The results indicate that the MPF algorithm closely approximates the semi-determinants computed by DGETRF, with relative errors remaining below \(1 \times 10^{-3}\) in most cases for each panel. This close agreement suggests that the MPF algorithm’s pivot selection is of sufficient quality. The variations in the semi-determinants for each panel arise from different pivot selections. If a panel’s determinant is lower than what was computed by DGETRF, it is anticipated that subsequent panels will exhibit larger semi-determinants. Figure 13 presents the semi-determinant relative error for several matrix sizes ranging from 1k to 40k. To keep the figures easy to follow, we have displayed just 10 panels. Our observations from the other panels indicate a consistent behavior.

Fig. 13
figure 13

Relative error of semi-determinant for multiple panels selected from matrix sizes ranging from 1k to 40k

6.5 Error bound derivation for different methods

Having reviewed the practical results from our tests, we now turn our attention to the theoretical foundations that support these findings. To understand the accuracy and reliability of the methods discussed, it is essential to examine the theoretical bounds on the absolute error of the logarithm of calculated determinant.

6.5.1 General error bound formula

We fit a power-law model to data using optimization to estimate parameters that minimize the error between observed data and model predictions. The observed data is the absolute error of finding the log determinant of matrices of different dimensions Power-law model is given by:

$$\begin{aligned} y = a \cdot x^b \end{aligned}$$

where \(a\) and \(b\) are the parameters, and \(x\) represents the matrix dimension. We aim to find parameters \(a\) and \(b\) that minimize the sum of squared errors:

$$\begin{aligned} \text {Objective Function} = \sum _{i=1}^{n} \left( y_i - a \cdot x_i^b \right) ^2 \end{aligned}$$

The MATLAB function fminsearch was used to optimize the power-law models for the methods. The fitted models are as follows:

  • XPRP:

    $$\begin{aligned} y_{\text {XPRP}} = 8.4316 \times 10^{-13} \cdot x^{2.6205} \end{aligned}$$
  • HPRP:

    $$\begin{aligned} y_{\text {HPRP}} = 5.2441 \times 10^{-9} \cdot x^{1.0521} \end{aligned}$$
  • MPF:

    $$\begin{aligned} y_{\text {MPF}} = 1.562 \times 10^{-11} \cdot x^{1.0931} \end{aligned}$$

The theoretical formula derived from these constants is illustrated in Fig. 14.

Fig. 14
figure 14

Theoretical error bounds for the logarithm of determinant for HPRP, XPRP, and MPF

6.5.2 Analysis based on the cramer’s rule

Our testing of the MPF method on large linear systems shows that the accuracy of solutions obtained with MPF is highly comparable to those produced by the DGETRF routine. To further understand the accuracy and reliability of MPF, it is useful to relate it to Cramer’s rule, which solves a system of linear equations by using determinants of matrices. For a system \(Ax = b\), the solution for each variable \(x_i\) is given by:

$$\begin{aligned} x_i = \frac{\det (A_i(b))}{\det (A)} \end{aligned}$$

where \(A_i(b)\) is the matrix \(A\) with its \(i\)-th column replaced by \(b\). Although this formula may not be used directly in many practical solution methods, it highlights a point: the determinant of \(A\) appears in the denominator of the solution expression. This relationship suggests that errors in \(\det (A)\) can indicate the error in the solution vector \(x\). Given that we computed \(\log (\det (A))\) with some error \(\delta\), the actual determinant can be expressed as:

$$\begin{aligned} \tilde{\det }(A) =\det (A) + \delta = \exp (\log (\det (A)) + \delta _{\log }) \end{aligned}$$

Using the property of exponentials, this can be written as:

$$\begin{aligned} \tilde{\det }(A) = \exp (\log (\det (A))) \cdot \exp (\delta _{\log }) = \det (A) \cdot \exp (\delta _{\log }) \end{aligned}$$

For small \(\delta _{\log }\), we can approximate \(\exp (\delta _{\log })\) using a first-order Taylor expansion:

$$\begin{aligned} \exp (\delta _{\log }) \approx 1 + \delta _{\log } \end{aligned}$$

Thus, the computed determinant is approximately:

$$\begin{aligned} \tilde{\det }(A) \approx \det (A) \cdot (1 + \delta _{\log }) = \det (A) + \det (A) \cdot \delta _{\log } \end{aligned}$$

Using the erroneous determinant \(\tilde{\det }(A)\), the computed solution becomes:

$$\begin{aligned} \tilde{x_i} = \frac{\det (A_i)}{\tilde{\det }(A)} \approx \frac{\det (A_i)}{\det (A) + \det (A) \cdot \delta _{\log }} \end{aligned}$$

The difference between the exact \(x_i\) and the computed \(\tilde{x_i}\) is:

$$\begin{aligned} \Delta x_i = x_i - \tilde{x_i} = \frac{\det (A_i)}{\det (A)} - \frac{\det (A_i)}{\det (A)(1 + \delta _{\log })} \end{aligned}$$

This can be factored as:

$$\begin{aligned} \Delta x_i = \frac{\det (A_i)}{\det (A)} \left( 1 - \frac{1}{1 + \delta _{\log }}\right) \end{aligned}$$

Expanding \(\frac{1}{1 + \delta _{\log }}\) using a first-order Taylor expansion:

$$\begin{aligned} \frac{1}{1 + \delta _{\log }} \approx 1 - \delta _{\log } \end{aligned}$$

This leads to:

$$\begin{aligned} \Delta x_i \approx \frac{\det (A_i)}{\det (A)} \left( 1 - (1 - \delta _{\log })\right) = \frac{\det (A_i)}{\det (A)} \cdot \delta _{\log } \end{aligned}$$

Therefore, the difference between the exact and computed solution \(x_i\) due to the error \(\delta _{\log }\) in \(\log (\det (A))\) is:

$$\begin{aligned} \Delta x_i \approx \delta _{\log } \cdot \frac{\det (A_i)}{\det (A)} \end{aligned}$$

This shows that the error in the computed \(x_i\) is proportional to both the error \(\delta _{\log }\) in the logarithm of the determinant and the exact value of \(x_i\). Given that the perturbation in the \(i\)-th component of the solution \(x_i\) is:

$$\begin{aligned} \Delta x_i = \delta _{\log } \cdot x_i \end{aligned}$$

To find the relative error in the solution vector \(\hat{x}\), we compute the infinity norm of the perturbation vector \(\Delta \textbf{x}\):

$$\Vert \Delta \textbf{x}\Vert _\infty = \max _i |\Delta x_i| = \max _i |\delta _{\log } \cdot x_i|$$

The relative error is then:

$$\frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty } \approx \frac{\Vert \Delta \textbf{x}\Vert _\infty }{\Vert x\Vert _\infty }$$

Substituting \(\Vert \Delta \textbf{x}\Vert _\infty\):

$$\frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty } \approx \frac{\max _i |\delta _{\log } \cdot x_i|}{\Vert x\Vert _\infty }$$

Since \(\max _i |x_i| = \Vert x\Vert _\infty\), this simplifies to:

$$\frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty } \approx \delta _{\log }$$

Thus, the relative error in the solution vector \(\hat{x}\) is approximately equal to the perturbation \(\delta _{\log }\) in \(\log (\det (A))\), indicating that the error is directly proportional to \(\delta _{\log }\).

6.5.3 Bounding the backward error

Given the forward error bound and based on the definition of condition number, we have:

$$\begin{aligned} \frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty } \le \kappa (A) \cdot \eta , \end{aligned}$$
(6)

where \(\kappa (A)\) is the condition number of the matrix \(A\), and \(\eta\) is the backward error. Rearranging this expression to solve for \(\eta\), we obtain:

$$\begin{aligned} \eta \ge \frac{\frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty }}{\kappa (A)}. \end{aligned}$$
(7)

Let \(\delta _{\log }\) denote the forward error:

$$\begin{aligned} \delta _{\log } \approx \frac{\Vert \hat{x} - x\Vert _\infty }{\Vert x\Vert _\infty }. \end{aligned}$$
(8)

Substituting \(\delta _{\log }\) into the expression for \(\eta\), we get:

$$\begin{aligned} \eta \ge \frac{\delta _{\log }}{\kappa (A)}. \end{aligned}$$
(9)

Based on the given forward errors and condition numbers of the test matrices, we assess the backward stability of three algorithms: HPRP, XPRP, and MPF. For matrices with condition numbers \(\kappa (A)\) ranging from \(10^3\) to \(10^5\), the computed minimum backward errors are as follows:

  • HPRP: \(\eta _{\text {min}} \approx 10^{-9} \text { to } 10^{-11}\)

  • XPRP: \(\eta _{\text {min}} \approx 10^{-11} \text { to } 10^{-13}\)

  • MPF: \(\eta _{\text {min}} \approx 10^{-13} \text { to } 10^{-15}\)

Given the observed forward errors for these methods are \(\delta _{\text {HPRP}} \approx 10^{-6}\), \(\delta _{\text {XPRP}} \approx 10^{-8}\), and \(\delta _{\text {MPF}} \approx 10^{-10}\), the backward errors are significantly smaller than the forward errors. This indicates that all three methods exhibit backward stability for the specified range of condition numbers.

6.6 Impact of block size on solution and decomposition accuracy in the MPF algorithm

In this part, we explore how block size influences the accuracy of solutions and decompositions in the MPF algorithm, specifically through the Eqs. (1) and (2).

Decomposition Accuracy:

When evaluating the decomposition accuracy using the Eq. (1), the ANOVA test showed a statistically significant difference in accuracy across different block sizes (F-statistic: 40.08, p-value: \(2.58 \times 10^{-16}\)). Despite the statistical significance, the practical magnitude of these variations in accuracy is relatively minor. The slight differences observed can be attributed primarily to the pivot selection process and the nonassociative nature of floating-point arithmetic, both of which can introduce minor inconsistencies as block size changes.

Solution Accuracy:

On the other side the Eq. (2) was used to evaluate solution accuracy across varying block sizes. Our analysis found that accuracy remains consistent regardless of changes in panel widths. An ANOVA test confirmed this observation, showing no statistically significant differences in accuracy (F-statistic: 0.015, p-value: 0.9995). This stability suggests that the block size does not significantly impact the results of the computed solution, likely due to the robustness of pivot selection and floating-point operations against changes in block size.

In summary, while the block size shows minimal impact on the solution accuracy of the MPF algorithm, it does have a statistically significant, yet practically minor, effect on decomposition accuracy (Figs. 15 and 16).

Fig. 15
figure 15

Impact of block size on \(\Vert PA-LU\Vert _F / n\Vert A\Vert _F\) for DGETRF (a) and MPF (b) algorithm

Fig. 16
figure 16

Impact of block size on \(\Vert b-Ax\Vert _F / n\Vert A\Vert _F\Vert x\Vert _F\) for DGETRF (a) and MPF (b) algorithm

6.7 Performance analysis of HPRP, XPRP, and MPF

In this section, the performance of the proposed algorithms is examined. The analysis is organized into two main subsections: Hybrid Mode, which explores a CPU-GPU computing setup; and Native Mode, concentrating on GPU-only computation. The performance results presented correspond to the best among those collected from at least five repetitions of each experiment. Furthermore, the input matrix is transposed in place without using an additional buffer and the experiment includes the time required for both the initial and final transpositions.

6.7.1 Hybrid mode

XPRP is the only one of the proposed algorithms that operates in hybrid mode. Figure 17 shows its speedup over the DGETRF on a V100s. As we can see, XPRP demonstrates notable speedup trends. Specifically, for dimensions ranging from 1k to 6k, the speedup remains around 1.4x. In the range of 8k to 24k, the speedup exceeds 2x for the majority of cases. From 26k to 30k dimensions, the speedup consistently surpasses 1.6x.

Fig. 17
figure 17

Speedup of XPRP over DGETRF for V100s in hybrid mode

The relative speed of GPU and CPU, as well as the latency in the communications among them, varies from platform to platform. The experiments conducted on the V100s show that the GPU can process the TU very fast and the latency of the PF is not hidden. Figure 18 presents a clipped segment of an execution trace captured using nsight systems. It shows a large gap in the execution of XPRP executed in hybrid mode on a V100s. During this time, the GPU sits idle since it has to wait for the result of the PF coming from the CPU.

Fig. 18
figure 18

XPRP in Hybrid mode: GPU waiting for the CPU

Fig. 19
figure 19

Performance of variants of the LU factorization executed in hybrid mode on a V100s GPU

In Fig. 19, we compare the performance achieved by various versions of an LU factorization running in hybrid CPU-GPU mode on a V100s GPU. These variants comprise the routines XSHGETRF, SGETRF, DGETRF_NOPIV, and DGETRF, sourced from MAGMA, along with our custom XPRP routine implemented in hybrid mode for V100s GPUs.

It is clear that using the XPRP algorithm in hybrid mode is faster than the hybrid variant of DGETRF. Indeed, while other algorithms may exhibit higher efficiency, they typically lack pivoting, as seen in DGETRF_NOPIV, or they are not executed in double precision, such as XSHGETRF and SGETRF. Nevertheless, observing substantial gaps in the execution trace indicates that our GPU remains idle for a long period in hybrid mode. Hence, our primary emphasis became optimizing the native mode of the PRP and MPF algorithms. This involves performing all computations, including panel factorization, directly within the GPU, without any data movement between the CPU and GPU. In the subsequent Sect. 6.7.2, we will analyze the performance of the proposed algorithms running in native mode.

6.7.2 Native mode

The main focus of this section is on the performance of MPF, which showed similar accuracy levels to DGETRF and superior accuracy compared to variants of PRP, as shown in Sect. 6.3. Nevertheless, at the end of this section results will include both variants of the PRP algorithm.

In the process of LU factorization for general (large) matrices, we can view each panel as a matrix that is Tall and Skinny (TS). If we observe enhancements through the application of HPRP to TS matrices then its usage as a replacement to the standard PF, as proposed in MPF, may imply improvements in the factorization of general matrices. Figure 20 shows the speedup of HPRP compared to DGETRF, operating on Tall and Skinny (TS) matrices with varying column sizes on the A100 GPU.

Fig. 20
figure 20

Speedup of HPRP versus DGETRF in native mode operating on Tall and Skinny Matrices on an A100 GPU, with subfigures representing different column sizes: (a) 1024, (b) 512, (c) 128, and (d) 64

These values correspond to the default number of columns that MAGMA uses during LU decomposition depending on matrix dimensions. In all cases, and for the sake of offering a fair comparison, the execution time of HPRP includes the cost of transforming the values from FP64 into FP16. However, the time of allocation of a buffer needed in the computation to store a copy of the panel in FP16 and the allocation of CUDA streams have not been considered. The reason is that they are reused when computing the PF for different panels within an LU factorization of a large matrix. Consequently, the cost of allocating that buffer and stream is amortized throughout all the executions. The time for scaling values is not considered either for both cases. Instead, the values stored in FP64 are assumed to be representable in FP16 up to a certain precision. It can be noted that HPRP demonstrates comparable or superior performance compared to DGETRF when applied to TS matrices of various widths in native mode. Thus, it is worth applying HPRP as a replacement of the standard PF within the LU decomposition of large matrices. In the rest of this section, we assess the performance of variants of MPF when applied to square matrices of different dimensions in native mode. In these experiments, we consider that, if necessary, the whole matrix has been previously scaled to fit in FP16 throughout the computation of the LU.

In a scenario where CUDA kernels for PF and GEMM have equal priority, running the entire PF in parallel with GEMM is not achievable. The designers of CUBLAS, instead of emphasizing maximum occupancy, have chosen to prioritize increased shared memory and register usage, aiming to hide the time of data movement from the device to register memory. However, there are specific moments within this situation where the execution of a segment of PF in parallel with GEMM becomes feasible. One such opportunity arises after launching the GEMM kernel and before allocating the entire memory. Additionally, toward the end of the GEMM execution, when the resources become free, there is another window of opportunity for concurrent execution on a micro-panel within the PF.

In our experiments, we observed that the occupancy of DGEMM operations was 12.5%. This indicates that although the DGEMM task was running, not all Double Fused Multiply-Add (DFMA) units on the GPU were fully utilized. Simultaneously, the shared memory and registers allocated to DGEMM were saturated. This design choice reflects CUBLAS prioritizing increased shared memory and register usage over achieving maximum thread occupancy.

As a consequence, other potential parallel execution kernels may experience delays while waiting for sufficient free shared memory and registers. The only opportunity for other kernels to execute before the system becomes saturated is during the distribution of blocks from the DGEMM kernel by the CUDA Work Distributor (CWD) or when the CWD is releasing resources. During this time, if PF and DGEMM have equal priority, both can execute in parallel, with two micro PF s running alongside DGEMM and the remaining micro PF s starting immediately after the DGEMM kernel finishes.

Moreover, the fusion of two routines, one for transposing the panel in FP64 and the other for converting the transposed panel to FP16, enables the generation of a transposed copy of the panel in FP16. This optimized CUDA kernel mitigates delays linked to launching separate kernels and minimizes additional data movements.

Fig. 21
figure 21

Speedup comparison of MPF with respect to DGETRF on different GPU architectures: V100s, A100 and RTX3090

Figure 21 illustrates the speedup of each variant of the MPF on different GPUs with respect to DGETRF running in native mode. On the V100s GPU Fig. 21a, it is evident that the MPF demonstrates enhanced performance for matrices dimensions up to 12k. Furthermore the maximum possible speedup gained by MPF + Custom LA is 1.6x. The dimensions at which the A100 GPU Fig. 21b benefits from the MPF vary; below 20k dimensions, there is a relative performance drop in MPF. This can be attributed to the A100’s large L2 cache and its exceptional performance in FP64 operations. The speedup of our native HGETRF compared to DGETRF on this GPU is moderate. However, as dimensions exceed this threshold, improvements become evident, reaching a maximum enhancement of 1.15x. Figure 21c demonstrates that the behavior of RTX3090 closely resembles that of the V100s. The maximum speedup achieved is 1.2x, which is attained through the MPF + Custom LA. Notably, for nearly all dimensions, there is an improvement, primarily due to the high ratio of FP16 performance to FP64 performance for this particular GPU.

The optimal variant of MPF (MPF + Custom LA, hereafter simply MPF) has been chosen for the remaining evaluations, along with HPRP and XPRP, to assess their performance against DGETRF. The performance results are depicted in Fig. 22 highlighting the impact of different architectures. For matrices with dimensions below 8K, both HPRP and XPRP exhibit similar behavior on V100 Fig. 22a and RTX3090 GPUs Fig. 22c. On these GPUs, it becomes evident that both algorithms surpass DGETRF in native mode. Notably, MPF consistently delivers superior performance, particularly within the mentioned dimensions. On A100 (Fig. 22(b)), both MPF and HPRP showcase superior performance for dimensions surpassing 30k, with MPF exhibiting faster speeds since this platform excels in FP64 operations. Conversely, XPRP does not demonstrate improvement in this architecture. Additionally, as discussed later, MPF maintains identical accuracy to DGETRF, establishing itself as the preferred method among the proposed approaches.

Fig. 22
figure 22

Speedup of HPRP, XPRP, and MPF with respect to DGETRF on V100 a, A100b, and RTX3090 c in native mode

6.8 Performance analysis of auxiliary routines

For the interested reader, this section explores the efficiency of auxiliary routines which are essential for fast execution of PRP and MPF. The description includes the performance evaluation of both hybrid and native implementations.

6.8.1 HGETRF (FP16) and XGETRF (FP16-FP32)

As already mentioned, we are using the LA technique for HGETRF and XGETRF. Based on the experiments conducted on a V100s GPU, we compared the performance for matrix sizes ranging from 1K to 30K, with performance measured in terms of speedup, defined as the ratio of execution time without LA to execution time with LA.

The performance results for both precision schemes are summarized in Fig. 23. The RTX3090 and A100 GPUs also benefit from LA for reduced precision, and partial overlap of tasks is visible. However, we have omitted this detail to keep the presentation focused and avoid unnecessary complexity. The pure half-precision implementation shows a peak speedup of about 1.14 around 9K, with a similar trend of decreasing speedup for larger matrices. For the mixed half-single-precision implementation, the speedup is modest for smaller matrix sizes (below 5K) but peaks at approximately 1.17 around 15K, before declining slightly for larger sizes, stabilizing at around 1.06 for matrices beyond 25K.

Comparative analysis indicates that LA improves performance in \(90\%\) of cases for pure half precision and \(83.33\%\) for mixed half-single precision. These results demonstrate that LA enhances LU performance in most cases, making it a recommended optimization technique for achieving optimal performance across a wide range of matrix sizes. Figure 24 shows the execution trace for the HGETRF and XGETRF functions. This visualization illustrates the overlap between the PF and TU operations. In particular, it can be seen that the PF function partially runs in parallel with TU during this execution. This concurrency can lead to improved performance through efficient use of computational resources.

Fig. 23
figure 23

Speedup achieved with LA for HGETRF (a) and XGETRF (b) compared to baseline performance without LA on V100s GPU

Fig. 24
figure 24

Trace of execution for HGETRF (a) and XGETRF (b). We can see the partial overlap of PF and TU

Figure 25 shows the speedup achieved by our native LU implementation in pure FP16 (HGETRF) with respect to the native DGETRF on the three platforms. The RTX 3090 offers limited support for FP64 calculations, with a theoretical peak performance ratio of approximately 255 when comparing the performance of FP16 to FP64. For this reason, the speedup achieved by HGETRF is growing rapidly. Due to the enhanced resources of the A100 (6.6 times larger L2 cache size, 1.37 times higher memory bandwidth, and the availability of Tensor Cores for FP64 operations), the performance of FP64 operations is boosted, and for that reason, the HGETRF does not achieve the same speedup as observed on the V100s.

Fig. 25
figure 25

a Speedup of Native HGETRF versus Native DGETRF on V100s, A100, and RTX3090. b Zoomed version dedicated to A100 and V100s

Figure 26 illustrates the performance comparison between the hybrid (MAGMA XSHGETRF) and native (XGETRF) implementations of FP16-FP32 LU factorization on the V100s (a) and the A100 (b). The slow communication between the GPU and the CPU notably impacts the A100’s performance in hybrid mode. Across both scenarios, the native implementation consistently achieves superior performance and delivers a speedup of about 2x on a V100s and approximately 14x on an A100 GPU.

Fig. 26
figure 26

Speedup of FP16-FP32 LU Factorization: Hybrid (MAGMA XSHGETRF) versus Native (XGETRF) on V100 (a) and A100 (b) GPUs

6.8.2 NPV (FP64)

Let us start by discussing the performance of the different implementations of the PF presented in Sect. 5.1.2 on a V100s GPU. As previously mentioned, the width of the panels inside the MAGMA library ranges from 32 to 1024, with the preferred panel size selected based on the matrix size. We investigated whether using panel widths different from the default settings in MAGMA would result in better performance. Our findings indicated only limited improvements in rare cases. As this is not the primary focus of this paper, we will not explore the impact of panel width further.

According to our experiments, shown in Fig. 27, opting for the block implementation over the recursive one proves to be essential for the largest panel widths. Furthermore, the shared memory implementation setting the micro-panel width to 64 columns outperforms warp shuffling in most scenarios.

Fig. 27
figure 27

NPV: Block/Recursive and Shared Memory/Warp Shuffle on V100S

Our evaluation led us to select the block version that leverages shared memory and 64 as the micro-panel width. Using such configuration in the sequel of this section, we present results of the native NPV on the three GPUs used in this study.

Figure 28 shows the performance of NPV on a V100s (top left), an A100 (top right) and a RTX3090 (bottom). The native code always performs better for V100s and A100 with a noticeable speedup of up to 2x on V100s and 2.75x on A100. The better support for computing in FP64 on the A100, greatly enhanced with its FP64-enabled Tensor Cores, sets a huge difference.

Fig. 28
figure 28

Speed Comparison: NPV (Hybrid vs. Native) on GPUs: V100 (a), A100 (b), RTX3090 (c)

In contrast, the bottom part the Fig. 28 shows that on the RTX3090 hybrid GPU-CPU operation can still be beneficial. This happens when the work associated with the TU performed on the GPU takes enough time to hide the latency necessary to perform the PF on the CPU. This is due to the reduced resources available for computing in FP64 on this GPU. Consequently, on the RTX3090 the performance of the hybrid NPV code is slightly better for matrices of dimension 7K and above.

Figure 29 details the speedup obtained by NPV over DGETRF on each of the three platforms after applying all the optimizations carried out. As expected, the characteristics of the GPU impact the relative performance of different algorithms. We can observe that smaller dimensions benefit most from performing an LU factorization without pivoting. This makes sense since the PF factorization accounts for a larger fraction of the total execution time compared to the large dimensions for which the TU completely dominates the execution time. On the RTX3090, both NPV and DGETRF quickly saturate the FP64 resources due to the limited support for computations in FP64 on this platform.

Fig. 29
figure 29

Speedup of NPV w.r.t. DGETRF on the 3 GPUs in native mode

7 Conclusions and future work

In this research, our primary objective was to enhance the efficiency of double-precision LU decomposition by integrating half precision to reduce data movement in FP64 operations. We introduced two algorithms: PRP and MPF, with the first developed in two variants, namely HPRP and XPRP. These variants were designed based on identifying the pivot list for LU decomposition in either fully FP16 or mixed FP16-FP32 precision. Then, by rearranging the rows of the original matrix according to the determined pivot list, the LU decomposition is performed on the original matrix in FP64 without pivoting. The MPF algorithm also employs a similar concept, but it applies HPRP exclusively to each panel factorization during the LU decomposition.

In this research, we explored both native (GPU-only) and hybrid (CPU-GPU) configurations. An important lesson was about the sensitivity of hardware configuration during data movement between the GPU and CPU which, contrary to expectations, does not always allow the hybrid version to perform better. To address this challenge, we developed and optimized several native implementations of LU factorization with partial pivoting, including mixed FP16-FP32 versions and a pure FP16 version. Notably, in terms of FP16, MAGMA lacks both GETRF and TRSM routines in half precision. Furthermore, a native version of LU without pivoting in FP64 was also implemented, which plays a crucial role in the proposed algorithms.

Our findings indicate that the performance of HPRP, XPRP, and MPF varies depending on the matrix dimension and platform. Across all platforms, MPF consistently performs better than the other LU factorization methods while providing an accuracy similar to that of DGETRF.

In the future research, an avenue to explore involves integrating the MPF method into the broader framework of matrices that are globally sparse but locally dense. As shown, the HPRP method demonstrates improved speed when applied to matrices with a limited number of columns. The multifrontal scheme for sparse matrices is widely useful in finite element analysis. This approach organizes the computational steps involved in factoring sparse matrices by focusing on partial factorizations of small, dense submatrices. In light of this, there is a compelling idea to utilize the MPF method for each of these dense submatrices, aiming to take advantage of its potential benefits. This approach is supported by the multifrontal method described by Davis and Duff (1997) [36], which outlines efficient strategies for matrix factorization.