Floating-point arithmetic

Sylvie Boldo; Claude-Pierre Jeannerod; Guillaume Melquiond; Jean-Michel Muller

doi:10.1017/S0962492922000101

Floating-point arithmetic

Part of: Computer aspects of numerical algorithms Error analysis and interval analysis Computer system organization

Published online by Cambridge University Press: 11 May 2023

Sylvie Boldo ,

Claude-Pierre Jeannerod ,

Guillaume Melquiond and

Jean-Michel Muller

Show author details

Sylvie Boldo: Affiliation:
Université Paris Saclay, CNRS, ENS Paris Saclay, Inria, LMF, 91190 Gif-sur-Yvette, France E-mail: [email protected]
Claude-Pierre Jeannerod: Affiliation:
Inria, ENS de Lyon, LIP, 69364 Lyon, France E-mail: [email protected]
Guillaume Melquiond: Affiliation:
Université Paris Saclay, CNRS, ENS Paris Saclay, Inria, LMF, 91190 Gif-sur-Yvette, France E-mail: [email protected]
Jean-Michel Muller: Affiliation:
CNRS, ENS de Lyon, LIP, 69364 Lyon, France E-mail: [email protected]

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Floating-point numbers have an intuitive meaning when it comes to physics-based numerical computations, and they have thus become the most common way of approximating real numbers in computers. The IEEE-754 Standard has played a large part in making floating-point arithmetic ubiquitous today, by specifying its semantics in a strict yet useful way as early as 1985. In particular, floating-point operations should be performed as if their results were first computed with an infinite precision and then rounded to the target format. A consequence is that floating-point arithmetic satisfies the ‘standard model’ that is often used for analysing the accuracy of floating-point algorithms. But that is only scraping the surface, and floating-point arithmetic offers much more.

In this survey we recall the history of floating-point arithmetic as well as its specification mandated by the IEEE-754 Standard. We also recall what properties it entails and what every programmer should know when designing a floating-point algorithm. We provide various basic blocks that can be implemented with floating-point arithmetic. In particular, one can actually compute the rounding error caused by some floating-point operations, which paves the way to designing more accurate algorithms. More generally, properties of floating-point arithmetic make it possible to extend the accuracy of computations beyond working precision.

MSC classification

Primary: 65Y04: Algorithms for computer arithmetic, etc.

Secondary: 65G50: Roundoff error 68M07: Mathematical problems of computer architecture

Type: Research Article
Information: Acta Numerica , Volume 32 , May 2023 , pp. 203 - 290

DOI: https://doi.org/10.1017/S0962492922000101 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2023. Published by Cambridge University Press

References

Agrawal, A., Mueller, S. M., Fleischer, B. M., Sun, X., Wang, N., Choi, J. and Gopalakrishnan, K. (2019), DLFloat: A 16-b floating point format designed for deep learning training and inference, in 26th IEEE Symposium on Computer Arithmetic, IEEE, pp. 92–95.Google Scholar

Anderson, C. S., Zhang, J. and Cornea, M. (2018), Enhanced vector math support on the Intel®AVX-512 architecture, in 25th IEEE Symposium on Computer Arithmetic, pp. 120–124.Google Scholar

Babuška, I. (1969), Numerical stability in mathematical analysis, in Proceedings of the 1968 IFIP Congress , Vol. 1, pp. 11–23.Google Scholar

Barnes, R. C. M., Cooke-Yarborough, E. H. and Thomas, D. G. A. (1951), An electronic digital computor using cold cathode counting tubes for storage (Part 1), Electron. Engng 23, 286–291.Google Scholar

Bartels, T., Fisikopoulos, V. and Weiser, M. (2022), Fast floating-point filters for robust predicates. Available at arXiv:2208.00497.Google Scholar

Baudin, M. and Smith, R. L. (2012), A robust complex division in Scilab. Available at arXiv:1210.4539.Google Scholar

Beebe, N. H. F. (2017), The Mathematical-Function Computation Handbook , Springer.CrossRef Google Scholar

Bertaccini, L., Paulin, G., Fischer, T., Mach, S. and Benini, L. (2022), MiniFloat-NN and ExSdotp: An ISA extension and a modular open hardware unit for low-precision training on RISC-V cores, in 29th IEEE Symposium on Computer Arithmetic.CrossRef Google Scholar

Blanchard, P., Higham, N. J. and Mary, T. (2020), A class of fast and accurate summation algorithms, SIAM J. Sci. Comput. 42, A1541–A1557.CrossRef Google Scholar

Bohlender, G., Walter, W., Kornerup, P. and Matula, D. (1991), Semantics for exact floating point operations, in 10th IEEE Symposium on Computer Arithmetic, pp. 22–26.Google Scholar

Boldo, S. (2006), Pitfalls of a full floating-point proof: Example on the formal proof of the Veltkamp/Dekker algorithms, in 3rd International Joint Conference on Automated Reasoning (Furbach, U. and Shankar, N., eds), Vol. 4130 of Lecture Notes in Computer Science, Springer, pp. 52–66.Google Scholar

Boldo, S. (2009), Kahan’s algorithm for a correct discriminant computation at last formally proven, IEEE Trans. Comput. 58, 220–225.CrossRef Google Scholar

Boldo, S. and Daumas, M. (2003), Representable correcting terms for possibly underflowing floating point operations, in 16th IEEE Symposium on Computer Arithmetic (Bajard, J.-C. and Schulte, M., eds), pp. 79–86.Google Scholar

Boldo, S. and Melquiond, G. (2008), Emulation of a FMA and correctly rounded sums: Proved algorithms using rounding to odd, IEEE Trans. Comput. 57, 462–471.CrossRef Google Scholar

Boldo, S. and Melquiond, G. (2017), Computer Arithmetic and Formal Proofs , ISTE Press / Elsevier.Google Scholar

Boldo, S. and Muller, J.-M. (2005), Some functions computable with a fused-mac, in 17th IEEE Symposium on Computer Arithmetic, pp. 52–58.Google Scholar

Boldo, S. and Muller, J.-M. (2011), Exact and approximated error of the FMA, IEEE Trans. Comput. 60, 157–164.CrossRef Google Scholar

Boldo, S., Graillat, S. and Muller, J.-M. (2017), On the robustness of the 2Sum and Fast2Sum algorithms, ACM Trans. Math. Softw . 44, 4:1–4:14.CrossRef Google Scholar

Boldo, S., Lauter, C. and Muller, J.-M. (2021), Emulating round-to-nearest ties-to-zero ‘augmented’ floating-point operations using round-to-nearest ties-to-even arithmetic, IEEE Trans. Comput. 70, 1046–1058.CrossRef Google Scholar

Borges, C. F. (2021), Algorithm 1014: An improved algorithm for Hypot $\left(x,y\right)$

, ACM Trans. Math. Softw. 47, 1–12.CrossRef Google Scholar

Borges, C. F., Jeannerod, C.-P. and Muller, J.-M. (2022), High-level algorithms for correctly-rounded reciprocal square roots, in 29th IEEE Symposium on Computer Arithmetic, pp. 18–25.Google Scholar

Brent, R. P. (1973), On the precision attainable with various floating-point number systems, IEEE Trans. Comput. C-22, 601–607.Google Scholar

Brent, R. P. (1978), Algorithm 524: MP, a Fortran multiple-precision arithmetic package [A1], ACM Trans. Math. Softw. 4, 71–81.CrossRef Google Scholar

Brent, R., Percival, C. and Zimmermann, P. (2007), Error bounds on complex floating-point multiplication, Math. Comp. 76, 1469–1481.CrossRef Google Scholar

Brisebarre, N. and Chevillard, S. (2007), Efficient polynomial L-approximations, in 18th IEEE Symposium on Computer Arithmetic, pp. 169–176.Google Scholar

Brisebarre, N. and Muller, J.-M. (2008), Correctly rounded multiplication by arbitrary precision constants, IEEE Trans. Comput. 57, 165–174.CrossRef Google Scholar

Brisebarre, N., Hanrot, G. and Robert, O. (2017), Exponential sums and correctly-rounded functions, IEEE Trans. Comput. 66, 2044–2057.CrossRef Google Scholar

Brisebarre, N., Joldeş, M., Muller, J.-M., Nanes, A.-M. and Picot, J. (2020), Error analysis of some operations involved in the Cooley–Tukey fast Fourier transform, ACM Trans. Math. Softw . 46, 11:1–11:27.CrossRef Google Scholar

Brunie, N., de Dinechin, F., Kupriianova, O. and Lauter, C. (2015), Code generators for mathematical functions, in 22nd IEEE Symposium on Computer Arithmetic, pp. 66–73.Google Scholar

Cameron, T. R. and Graillat, S. (2022), On a compensated Ehrlich–Aberth method for the accurate computation of all polynomial roots, Electron . Trans. Numer. Anal. 55, 401–423.Google Scholar

Castaldo, A. M., Whaley, R. C. and Chronopoulos, A. T. (2009), Reducing floating point error in dot product using the superblock family of algorithms, SIAM J. Sci. Comput. 31, 1156–1174.CrossRef Google Scholar

Ceruzzi, P. E. (1981), The early computers of Konrad Zuse, 1935 to 1945, Ann. Hist. Comput. 3, 241–262.CrossRef Google Scholar

Champagne, W. P. (1964), On finding roots of polynomials by hook or by crook. MSc thesis, University of Texas, Austin, TX.Google Scholar

Chevillard, S., Harrison, J., Joldeş, M. and Lauter, C. (2011), Efficient and accurate computation of upper bounds of approximation errors, Theoret . Comput. Sci. 412, 1523–1543.Google Scholar

Chevillard, S., Joldeş, M. and Lauter, C. (2010), Sollya: An environment for the development of numerical codes, in International Conference on Mathematical Software (Fukuda, K. et al., eds), Vol. 6327 of Lecture Notes in Computer Science, Springer, pp. 28–31.Google Scholar

Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Abeydeera, M., Adams, L., Angepat, H., Boehn, C., Chiou, D., Firestein, O., Forin, A., Gatlin, K. S., Ghandi, M., Heil, S., Holohan, K., Husseini, A. El, Juhasz, T., Kagi, K., Kovvuri, R. K., Lanka, S., van Megen, F., Mukhortov, D., Patel, P., Perez, B., Rapsang, A., Reinhardt, S., Rouhani, B., Sapek, A., Seera, R., Shekar, S., Sridharan, B., Weisz, G., Woods, L., Xiao, P. Yi, Zhang, D., Zhao, R. and Burger, D. (2018), Serving DNNs in real time at datacenter scale with project brainwave, IEEE Micro 38, 8–20.CrossRef Google Scholar

Cocke, J. and Markstein, V. (1990), The evolution of RISC technology at IBM, IBM J. Res. Dev. 34, 4–11.CrossRef Google Scholar

Cococcioni, M., Rossi, F., Ruffaldi, E. and Saponara, S. (2022), Small reals representations for deep learning at the edge: A comparison, in Next Generation Arithmetic (Gustafson, J. and Dimitrov, V., eds), Springer, pp. 117–133.Google Scholar

Cody, W. J. and Waite, W. (1980), Software Manual for the Elementary Functions , Prentice-Hall.Google Scholar

Collange, C., Defour, D., Graillat, S. and Iakymchuk, R. (2015), Numerical reproducibility for the parallel reduction on multi- and many-core architectures, Parallel Comput. 49, 83–97.CrossRef Google Scholar

Connolly, M. P. and Higham, N. J. (2022), Probabilistic rounding error analysis of Householder QR factorization. MIMS EPrint 2022.5, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths. manchester.ac.uk/2865/.Google Scholar

Connolly, M. P., Higham, N. J. and Mary, T. (2021), Stochastic rounding and its probabilistic backward error analysis, SIAM J. Sci. Comput. 43, A566–A585.CrossRef Google Scholar

Connolly, M. P., Higham, N. J. and Pranesh, S. (2022), Randomized low rank matrix approximation: Rounding error analysis and a mixed precision algorithm. MIMS EPrint 2022.10, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/2863/.Google Scholar

Cornea-Hasegan, M. A., Golliver, R. A. and Markstein, P. (1999), Correctness proofs outline for Newton–Raphson based floating-point divide and square root algorithms, in 14th IEEE Symposium on Computer Arithmetic, pp. 96–105.Google Scholar

Cornea, M., Harrison, J. and Tang, P. T. P. (2002), Scientific Computing on Itanium-based Systems, Intel Press.Google Scholar

Croci, M., Fasi, M., Higham, N. J., Mary, T. and Mikaitis, M. (2022), Stochastic rounding: implementation, error analysis and applications, Royal Soc . Open Sci. 9, 1–25.Google Scholar

Darcy, J. (2017), Restore always-strict floating-point semantics. Technical report JEP 306.Google Scholar

Daumas, M. (1999), Multiplications of floating point expansions, in 14th IEEE Symposium on Computer Arithmetic, pp. 250–257.Google Scholar

Daumas, M., Rideau, L. and Théry, L. (2001), A generic library of floating-point numbers and its application to exact computing, in 14th International Conference on Theorem Proving in Higher Order Logics (Boulton, R. J. and Jackson, P. B., eds), Vol. 2152 of Lecture Notes in Computer Science, Springer, pp. 169–184.Google Scholar

de Dinechin, F., Forget, L., Muller, J.-M. and Uguen, Y. (2019), Posits: The good, the bad and the ugly, in Conference on Next-Generation Arithmetic, ACM Press, pp. 1–10.Google Scholar

de Dinechin, F., Lauter, C. and Melquiond, G. (2011), Certifying the floating-point implementation of an elementary function using Gappa, IEEE Trans. Comput. 60, 242–253.CrossRef Google Scholar

Dekker, T. J. (1971), A floating-point technique for extending the available precision, Numer . Math. 18, 224–242.Google Scholar

Demmel, J. (1984), Underflow and the reliability of numerical software, SIAM J. Sci. Statist. Comput. 5, 887–919.CrossRef Google Scholar

Demmel, J., Ahrens, P. and Nguyen, H. D. (2016), Efficient reproducible floating point summation and BLAS. Technical report UCB/EECS-2016-121, EECS Department, University of California, Berkeley.Google Scholar

Demmel, J. and Hida, Y. (2004), Fast and accurate floating point summation with application to computational geometry, Numer . Algorithms 37, 101–112.CrossRef Google Scholar

Demmel, J. and Nguyen, H. D. (2015), Parallel reproducible summation, IEEE Trans. Comput. 64, 2060–2070.CrossRef Google Scholar

Demmel, J. and Riedy, J. (2021), A new IEEE 754 standard for floating-point arithmetic in an ever-changing world, SIAM News 54, 9.Google Scholar

Demmel, J., Dongarra, J., Gates, M., Henry, G., Langou, J., Li, X., Luszczek, P., Pereira, W., Riedy, J. and Rubio-González, C. (2022), Proposed consistent exception handling for the BLAS and LAPACK. Available at arXiv:2207.09281.CrossRef Google Scholar

El Arar, E.-M., Sohier, D., de Oliveira Castro, P. and Petit, E. (2022), The positive effects of stochastic rounding in numerical algorithms, in 29th IEEE Symposium on Computer Arithmetic, pp. 58–65.Google Scholar

Fabiano, N., Muller, J.-M. and Picot, J. (2019), Algorithms for triple-word arithmetic, IEEE Trans. Comput. 68, 1573–1583.CrossRef Google Scholar

Fasi, M. and Mikaitis, M. (2020), CPFloat: A C library for simulating low-precision arithmetic. MIMS EPrint 2020.22, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/2873/.Google Scholar

Fasi, M., Higham, N. J., Mikaitis, M. and Pranesh, S. (2021), Numerical behavior of NVIDIA tensor cores, PeerJ Comput. Sci. 7, e330.CrossRef Google Scholar PubMed

Févotte, F. and Lathuilière, B. (2016), VERROU: Assessing floating-point accuracy without recompiling. Available at https://hal.archives-ouvertes.fr/hal-01383417.Google Scholar

Figueroa, S. A. (1995), When is double rounding innocuous?, ACM SIGNUM Newsletter 30, 21–26.CrossRef Google Scholar

Flegg, G., Hay, C. and Moss, B. (1985), Nicolas Chuquet, Renaissance Mathematician: A Study With Extensive Translation of Chuquet’s Mathematical Manuscript Completed in 1484 , Springer.Google Scholar

Forsythe, G. E. (1959), Reprint of a note on rounding-off errors, SIAM Review 1, 66–67.CrossRef Google Scholar

Fortune, S. and Van Wyk, C. J. (1993), Efficient exact arithmetic for computational geometry, in 9th Annual Symposium on Computational Geometry, ACM, pp. 163–172.Google Scholar

Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P. and Zimmermann, P. (2007), MPFR: A multiple-precision binary floating-point library with correct rounding, ACM Trans. Math. Softw. 33, 13–es.CrossRef Google Scholar

Friedland, P. (1967), Algorithm 312: Absolute value and square root of a complex number, Commun . Assoc. Comput. Mach. 10, 665.Google Scholar

Gill, S. (1951), A process for the step-by-step integration of differential equations in an automatic digital computing machine, Math. Proc. Cambridge Philos. Soc. 47, 96–108.CrossRef Google Scholar

Goldberg, D. (1991), What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys 23, 5–48. Edited reprint available at https://docs.oracle.com/cd/E19059-01/fortec6u2/806-7996/806-7996.pdf from Sun’s Numerical Computation Guide; it contains an addendum Differences Among IEEE 754 Implementations, also available at http://www.validlab.com/goldberg/addendum.html.CrossRef Google Scholar

Goldberg, I. B. (1967), 27 bits are not enough for 8-digit accuracy, Commun . Assoc. Comput. Mach. 10, 105–106.Google Scholar

Goualard, F. (2014), How do you compute the midpoint of an interval?, ACM Trans. Math. Softw. 40, 11:1–11:25.CrossRef Google Scholar

Goualard, F. (2022), Drawing random floating-point numbers from an interval, ACM Trans. Model. Comput. Simul. 32, 16:1–16:24.CrossRef Google Scholar

Graillat, S. and Ménissier-Morain, V. (2007), Error-free transformations in real and complex floating-point arithmetic, in 2007 International Symposium on Nonlinear Theory and its Applications, pp. 341–344.Google Scholar

Graillat, S. and Ménissier-Morain, V. (2008), Compensated Horner scheme in complex floating point arithmetic, in 8th Conference on Real Numbers and Computer, pp. 133–146.Google Scholar

Graillat, S. and Ménissier-Morain, V. (2012), Accurate summation, dot product and polynomial evaluation in complex floating-point arithmetic, Inform. Comput. 216, 57–71.CrossRef Google Scholar

Graillat, S., Lefèvre, V. and Muller, J.-M. (2020), Alternative split functions and Dekker’s product, in 27th IEEE Symposium on Computer Arithmetic, pp. 41–47.Google Scholar

Gregory, R. T. and Raney, J. L. (1964), Floating-point arithmetic with 84-bit numbers, Commun . Assoc. Comput. Mach. 7, 10–13.Google Scholar

Gustafson, J. L. (2015), The End of Error: Unum Computing , Chapman & Hall / CRC.Google Scholar

Hallman, E. and Ipsen, I. C. F. (2022), Precision-aware deterministic and probabilistic error bounds for floating point summation. Available at arXiv:2203.15928.Google Scholar

Harrison, J. (1999), A machine-checked theory of floating point arithmetic, in 12th International Conference in Theorem Proving in Higher Order Logics (Bertot, Y. et al., eds), Vol. 1690 of Lecture Notes in Computer Science, Springer, pp. 113–130.Google Scholar

Hauser, J. R. (1996), Handling floating-point exceptions in numeric programs, ACM Trans. Program. Lang. Syst. 18, 139–174.CrossRef Google Scholar

He, Y. and Ding, C. H. Q. (2000), Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications, in 14th International Conference on Supercomputing, ACM, pp. 225–234.Google Scholar

Hennessy, J. L. and Patterson, D. A. (2012), Computer Architecture: A Quantitative Approach , fifth edition, Morgan Kaufman.Google Scholar

Henry, G., Tang, P. T. P. and Heinecke, A. (2019), Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations, in 26th IEEE Symposium on Computer Arithmetic, pp. 69–76.Google Scholar

Hida, Y., Li, X. S. and Bailey, D. H. (2001), Algorithms for quad-double precision floating-point arithmetic, in 15th IEEE Symposium on Computer Arithmetic, pp. 155–162.Google Scholar

Higham, N. J. (1993), The accuracy of floating point summation, SIAM J. Sci. Comput. 14, 783–799.CrossRef Google Scholar

Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms, second edition, SIAM.CrossRef Google Scholar

Higham, N. J. (2021a), The mathematics of floating-point arithmetic, LMS Newsletter 493, 35–41.Google Scholar

Higham, N. J. (2021b), Numerical stability of algorithms at extreme scale and low precisions. MIMS EPrint 2021.14, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/id/ eprint/2833.Google Scholar

Higham, N. J. and Mary, T. (2019), A new approach to probabilistic rounding error analysis, SIAM J. Sci. Comput. 41, A2815–A2835.CrossRef Google Scholar

Higham, N. J. and Mary, T. (2020), Sharper probabilistic backward error analysis for basic linear algebra kernels with random data, SIAM J. Sci. Comput. 42, A3427–A3446.CrossRef Google Scholar

Higham, N. J. and Mary, T. (2022), Mixed precision algorithms in numerical linear algebra, Acta Numer. 31, 347–414.CrossRef Google Scholar

Higham, N. J. and Pranesh, S. (2019), Simulating low precision floating-point arithmetic, SIAM J. Sci. Comput. 41, C585–C602.CrossRef Google Scholar

Hirshfeld, A. (2009), Eureka Man: The Life and Legacy of Archimedes, Walker & Company.Google Scholar

Hull, T. E., Fairgrieve, T. F. and Tang, P. T. P. (1994), Implementing complex elementary functions using exception handling, ACM Trans. Math. Softw. 20, 215–244.CrossRef Google Scholar

IEEE (2015), IEEE Standard for Interval Arithmetic (IEEE Std 1788-2015), IEEE.Google Scholar

IEEE (2019), IEEE Standard for Floating-Point Arithmetic (IEEE Std 754-2019), IEEE.Google Scholar

Iffrah, G. (1999), The Universal History of Numbers: From Prehistory to the Invention of the Computer, Wiley.Google Scholar

Ikebe, Y. (1965), Note on triple-precision floating-point arithmetic with 132-bit numbers, Commun . Assoc. Comput. Mach. 8, 175–177.Google Scholar

Innocente, V. and Zimmermann, P. (2022), Accuracy of mathematical functions in single, double, extended double and quadruple precision. Available at hal-03141101.Google Scholar

Intel (2018), BFLOAT16: Hardware numerics definition. White Paper, available at https://www.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf.Google Scholar

International Organization for Standardization (2010), Programming Languages – Fortran – Part 1: Base language, International Standard ISO/IEC 1539-1:2010.Google Scholar

International Organization for Standardization, Geneva, Switzerland (2011), Programming Languages – C, International Standard ISO/IEC 9899:2011.Google Scholar

Ipsen, I. C. F. and Zhou, H. (2020), Probabilistic error analysis for inner products, SIAM J. Matrix Anal. Appl. 41, 1726–1741.CrossRef Google Scholar PubMed

ISO/IEC (2022), C programming language – N3054, working draft of the standard (September 2022). https://en.wikipedia.org/wiki/C2.Google Scholar

Jeannerod, C.-P. (2016), A radix-independent error analysis of the Cornea–Harrison–Tang method, ACM Trans. Math. Softw. 42, 19:1–19:20.CrossRef Google Scholar

Jeannerod, C.-P. (2020), The relative accuracy of (x+y)*(x-y), J. Comput. Appl. Math. 369, 112613.CrossRef Google Scholar

Jeannerod, C.-P. and Muller, J.-M. (2017), On the relative error of computing complex square roots in floating-point arithmetic, in 51st Asilomar Conference on Signals, Systems, and Computers, IEEE, pp. 737–740.Google Scholar

Jeannerod, C.-P. and Rump, S. M. (2018), On relative errors of floating-point operations: optimal bounds and applications, Math. Comp. 87, 803–819.CrossRef Google Scholar

Jeannerod, C.-P., Kornerup, P., Louvet, N. and Muller, J.-M. (2017a), Error bounds on complex floating-point multiplication with an FMA, Math. Comp. 86, 881–898.CrossRef Google Scholar

Jeannerod, C.-P., Louvet, N. and Muller, J.-M. (2013a), Further analysis of Kahan’s algorithm for the accurate computation of $2\times 2$

determinants, Math. Comp. 82, 2245–2264.CrossRef Google Scholar

Jeannerod, C.-P., Louvet, N. and Muller, J.-M. (2013b), On the componentwise accuracy of complex floating-point division with an FMA, in 21st IEEE Symposium on Computer Arithmetic (A. Nannarelli et al., eds), pp. 83–90.CrossRef Google Scholar

Jeannerod, C.-P., Louvet, N., Muller, J.-M. and Plet, A. (2016), Sharp error bounds for complex floating-point inversion, Numer . Algorithms 73, 735–760.CrossRef Google Scholar

Jeannerod, C.-P., Monat, C. and Thévenoux, L. (2017b), More accurate complex multiplication for embedded processors, in 12th IEEE International Symposium on Industrial Embedded Systems, pp. 1–4.CrossRef Google Scholar

Jeannerod, C.-P., Muller, J.-M. and Zimmermann, P. (2018), On various ways to split a floating-point number, in 25th IEEE Symposium on Computer Arithmetic, IEEE, pp. 53–60.Google Scholar

Jiang, H., Graillat, S., Barrio, R. and Yang, C. (2016), Accurate, validated and fast evaluation of elementary symmetric functions and its application, Appl. Math. Comput. 273, 1160–1178.Google Scholar

Johansson, F. (2013), Arb: A C library for ball arithmetic, ACM Commun . Comput. Algebra 47, 166–169.Google Scholar

Joldeş, M., Muller, J.-M. and Popescu, V. (2017), Tight and rigorous error bounds for basic building blocks of double-word arithmetic, ACM Trans. Math. Softw. 44, 1–27.CrossRef Google Scholar

Joldeş, M., Muller, J.-M., Popescu, V. and Tucker, W. (2016), CAMPARY: Cuda multiple precision arithmetic library and applications, in 5th International Congress on Mathematical Software (Greuel, G. M. et al., eds), Vol. 9725 of Lecture Notes in Computer Science, Springer, pp. 232–240.Google Scholar

Kahan, W. (1965), Pracniques: Further remarks on reducing truncation errors, Commun . Assoc. Comput. Mach. 8, 40.Google Scholar

Kahan, W. (1981), Why do we need a floating-point arithmetic standard? Technical report, Computer Science, UC Berkeley. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf.Google Scholar

Kahan, W. (1987), Branch cuts for complex elementary functions or much ado about nothing’s sign bit, in The State of the Art in Numerical Analysis (Iserles, A. and Powell, M. J. D., eds), Oxford University Press, pp. 165–211.Google Scholar

Kahan, W. (1997), Lecture notes on the status of IEEE standard 754 for binary floating-point arithmetic. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF.Google Scholar

Kahan, W. (1998), Matlab’s loss is nobody’s gain. Available at https://people.eecs.berkeley.edu/~wkahan/MxMulEps.pdf.Google Scholar

Kahan, W. (2004a), A logarithm too clever by half. Available at http://http.cs.berkeley.edu/~wkahan/LOG10HAF.TXT.Google Scholar

Kahan, W. (2004b), On the cost of floating-point computation without extra-precise arithmetic. Available at http://www.cs.berkeley.edu/~wkahan/Qdrtcs.pdf.Google Scholar

Kahan, W. and Thomas, J. W. (1991), Augmenting a programming language with complex arithmetic. Technical report UCB/CSD-92-667, EECS Department, University of California, Berkeley.Google Scholar

Karpinsky, R. (1985), PARANOIA: A floating-point benchmark, BYTE 10, 223.Google Scholar

Knuth, D. E. (1998), The Art of Computer Programming, Vol. 2, third edition, Addison-Wesley.Google Scholar

Kornerup, P., Lefèvre, V., Louvet, N. and Muller, J.-M. (2012), On the computation of correctly rounded sums, IEEE Trans. Comput. 61, 289–298. A proof of Theorems 2 and 3 can be found at https://hal.inria.fr/inria-00475279.CrossRef Google Scholar

Kouya, T. (2019), Performance evaluation of an efficient double-double BLAS1 function with error-free transformation and its application to explicit extrapolation methods, in 26th IEEE Symposium on Computer Arithmetic, pp. 120–123.Google Scholar

Kuki, H. and Cody, W. J. (1973), A statistical study of the accuracy of floating point number systems, Commun . Assoc. Comput. Mach. 16, 223–230.Google Scholar

Kulisch, U. (1971), An axiomatic approach to rounded computations, Numer . Math. 18, 1–17.Google Scholar

Kulisch, U. (2013), Computer Arithmetic and Validity: Theory, Implementation, and Applications, Vol. 33 of Studies in Mathematics, De Gruyter.CrossRef Google Scholar

La Porte, M. and Vignes, J. (1974), Error analysis in computing, in Information Processing 74, North-Holland.Google Scholar

Lange, M. (2022), Toward accurate and fast summation, ACM Trans. Math. Softw. 48, 1–39.CrossRef Google Scholar

Lange, M. and Oishi, S. (2020), A note on Dekker’s FastTwoSum algorithm, Numer . Math. 145, 383–403.Google Scholar

Lange, M. and Rump, S. M. (2017), Error estimates for the summation of real numbers with application to floating-point summation, BIT Numer. Math. 57, 927–941.CrossRef Google Scholar

Lange, M. and Rump, S. M. (2019), Sharp estimates for perturbation errors in summations, Math. Comp. 88, 349–368.CrossRef Google Scholar

Lange, M. and Rump, S. M. (2020), Faithfully rounded floating-point computations, ACM Trans. Math. Softw. 46, 1–20.CrossRef Google Scholar

Langlois, P. and Louvet, N. (2007), How to ensure a faithful polynomial evaluation with the compensated Horner algorithm, in 18th IEEE Symposium on Computer Arithmetic, pp. 141–149.Google Scholar

Lawlor, O., Govind, H., Dooley, I., Breitenfeld, M. and Kale, L. (2005), Performance degradation in the presence of subnormal floating-point values, in International Workshop on Operating System Interference in High Performance Application.Google Scholar

Lefèvre, V. (2013), SIPE: Small Integer Plus Exponent, in 21th IEEE Symposium on Computer Arithmetic, pp. 99–106.Google Scholar

Lefèvre, V. and Muller, J.-M. (2001), Worst cases for correct rounding of the elementary functions in double precision, in 15th IEEE Symposium on Computer Arithmetic, pp. 111–118.Google Scholar

Lefèvre, V., Louvet, N., Muller, J.-M., Picot, J. and Rideau, L. (2022), Accurate calculation of Euclidean norms using double-word arithmetic, ACM Trans. Math. Softw. https://doi.org/10.1145/3568672.CrossRef Google Scholar

Li, X., Demmel, J., Bailey, D. H., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kapur, A., Martin, M., Tung, T. and Yoo, D. J. (2000), Design, implementation and testing of extended and mixed precision BLAS. Technical report 45991, Lawrence Berkeley National Laboratory. Available at https://netlib.org/lapack/lawnspdf/lawn149.pdf.Google Scholar

Lichtenau, C., Buyuktosunoglu, A., Bertran, R., Figuli, P., Jacobi, C., Papandreou, N., Pozidis, H., Saporito, A., Sica, A. and Tzortzatos, E. (2022), AI accelerator on IBM Telum processor: Industrial product, in 49th ACM International Symposium on Computer Architecture, ACM, pp. 1012–1028.Google Scholar

Lohner, R. J. (2001), On the ubiquity of the wrapping effect in the computation of error bounds, in Perspectives on Enclosure Methods (Kulisch, U. et al., eds), Springer, pp. 201–216.Google Scholar

Lynch, T. and Swartzlander, E. (1992), A formalization for computer arithmetic, in Computer Arithmetic and Enclosure Methods (Atanassova, L. and Hertzberger, J., eds), Elsevier Science, pp. 137–145.Google Scholar

Malcolm, M. A. (1971), On accurate floating-point summation, Commun . Assoc. Comput. Mach. 14, 731–736.Google Scholar

Markstein, P. (1990), Computation of elementary functions on the IBM RISC System/6000 processor, IBM J. Res. Dev. 34, 111–119.CrossRef Google Scholar

Mascarenhas, W. F. (2016), Floating point numbers are real numbers. Available at arXiv:1605.09202.Google Scholar

Matula, D. W. (1968), In-and-out conversions, Commun . Assoc. Comput. Mach. 11, 47–50.Google Scholar

Melquiond, G. (2019), Formal verification for numerical computations, and the other way around. Habilitation à Diriger des Recherches, Université Paris Sud, Orsay.Google Scholar

Mezzarobba, M. (2010), NumGfun: A package for numerical and analytic computation with D-finite functions, in Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ACM, pp. 139–145.Google Scholar

Mezzarobba, M. (2020), Rounding error analysis of linear recurrences using generating series. Available at arXiv:2011.00827.Google Scholar

Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., Mellempudi, N., Oberman, S., Shoeybi, M., Siu, M. and H, W (2022), FP8 formats for deep learning. Available at https://paperswithcode.com/ paper/fp8-formats-for-deep-learning.Google Scholar

Møller, O. (1965), Quasi double-precision in floating-point addition, BIT 5, 37–50.CrossRef Google Scholar

Monniaux, D. (2008), The pitfalls of verifying floating-point computations, ACM Trans. Program. Lang. Syst. 30, 1–41.CrossRef Google Scholar

Moore, J. S., Lynch, T. and Kaufmann, M. (1998), A mechanically checked proof of the correctness of the kernel of the AMD5K86 floating point division algorithm, IEEE Trans. Comput. 47, 913–926.CrossRef Google Scholar

Moore, R. E. (1979), Methods and Applications of Interval Analysis, SIAM Studies in Applied Mathematics, SIAM.Google Scholar

Moore, R. E., Kearfott, R. B. and Cloud, M. J. (2009), Introduction to Interval Analysis, SIAM.CrossRef Google Scholar

Muller, J.-M. (2015), On the error of computing $ab+ cd$

using Cornea, Harrison and Tang’s method, ACM Trans. Math. Softw. 41, 7:1–7:8.CrossRef Google Scholar

Muller, J.-M. (2016), Elementary Functions, Algorithms and Implementation, third edition, Birkhäuser.Google Scholar

Muller, J.-M. and Rideau, L. (2022), Formalization of double-word arithmetic, and comments on ‘Tight and rigorous error bounds for basic building blocks of double-word arithmetic’, ACM Trans. Math. Softw. 48, 1–24.CrossRef Google Scholar

Muller, J.-M., Brunie, N., de Dinechin, F., Jeannerod, C.-P., Joldeş, M., Lefèvre, V., Melquiond, G., Revol, N. and Torres, S. (2018), Handbook of Floating-Point Arithmetic, second edition, Birkhäuser.CrossRef Google Scholar

Neumaier, A. (1974), Rundungsfehleranalyse einiger Verfahren zur Summation endlicher Summen, ZAMM 54, 39–51. In German.Google Scholar

Neumaier, A. (1990), Interval Methods for Systems of Equations, Cambridge University Press.Google Scholar

Nievergelt, Y. (2003), Scalar fused multiply-add instructions produce floating-point matrix arithmetic provably accurate to the penultimate digit, ACM Trans. Math. Softw. 29, 27–48.CrossRef Google Scholar

Noune, B., Jones, P., Justus, D., Masters, D. and Luschi, C. (2022), 8-bit numerical formats for deep neural networks. Available at arXiv:2206.02915.Google Scholar

Ogita, T., Rump, S. M. and Oishi, S. (2005), Accurate sum and dot product, SIAM J. Sci. Comput. 26, 1955–1988.CrossRef Google Scholar

Olver, F. W. J. (1983), Error analysis of complex arithmetic, in Computational Aspects of Complex Analysis, Vol. 102 of NATO Science Series C, D. Reidel, pp. 279–292.Google Scholar

Osorio, J., Armejach, A., Petit, E., Henry, G. and Casas, M. (2022), A BF16 FMA is all you need for DNN training, IEEE Trans. Emerg. Topics Comput. 10, 1302–1314.CrossRef Google Scholar

Overton, M. L. (2001), Numerical Computing with IEEE Floating Point Arithmetic, SIAM.CrossRef Google Scholar

Ozaki, K., Ogita, T. and Mukunoki, D. (2021), Interval matrix multiplication using fast low-precision arithmetic on GPU, in 9th International Workshop on Reliable Engineering Computing, pp. 419–434.Google Scholar

Ozaki, K., Ogita, T., Oishi, S. and Rump, S. M. (2012), Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications, Numer . Algorithms 59, 95–118.CrossRef Google Scholar

Parker, D. S., Pierce, B. and Eggert, P. R. (2000), Monte Carlo arithmetic: How to gamble with floating point and win, Comput. Sci. Engng 2, 58–68.CrossRef Google Scholar

Pichat, M. (1972), Correction d’une somme en arithmétique à virgule flottante, Numer . Math. 19, 400–406. In French .CrossRef Google Scholar

Pichat, M. (1976), Contribution à l’étude des erreurs d’arrondi en arithmétique à virgule flottante. PhD thesis, Université Scientifique et Médicale de Grenoble & Institut National Polytechnique de Grenoble.Google Scholar

Pion, S. (1999), De la géométrie algorithmique au calcul géométrique. PhD dissertation, Université Nice Sophia Antipolis.Google Scholar

Popescu, V. (2017), Towards fast and certified multiple-precision librairies. PhD dissertation, Université de Lyon, no. 2017LYSEN036.Google Scholar

Posit Working Group (2022), Standard for posit arithmetic. Available at https://posithub.org/docs/posit_standard-2.pdf.Google Scholar

Priest, D. M. (1991), Algorithms for arbitrary precision floating point arithmetic, in 10th IEEE Symposium on Computer Arithmetic, pp. 132–143.Google Scholar

Priest, D. M. (1992), On properties of floating-point arithmetics: Numerical stability and the cost of accurate computations. PhD thesis, University of California at Berkeley.Google Scholar

Priest, D. M. (2004), Efficient scaling for complex division, ACM Trans. Math. Softw. 30, 389–401.CrossRef Google Scholar

Revol, N. and Rouillier, F. (2005), Motivations for an arbitrary precision interval arithmetic and the MPFI library, Reliable Computing 11, 275–290.CrossRef Google Scholar

Riedy, E. J. and Demmel, J. (2018), Augmented arithmetic operations proposed for IEEE-754 2018, in 25th IEEE Symposium on Computer Arithmetic, pp. 45–52.Google Scholar

Roux, P. (2014), Innocuous double rounding of basic arithmetic operations, J. Formal. Reasoning 7, 131–142.Google Scholar

Rump, S. M. (2009), Ultimately fast accurate summation, SIAM J. Sci. Comput. 31, 3466–3502.CrossRef Google Scholar

Rump, S. M. (2010), Verification methods: Rigorous results using floating-point arithmetic, Acta Numer. 19, 287–449.CrossRef Google Scholar

Rump, S. M. (2012), Error estimation of floating-point summation and dot product, BIT Numer. Math. 52, 201–220.CrossRef Google Scholar

Rump, S. M. (2015), Computable backward error bounds for basic algorithms in linear algebra, Nonlinear Theory Appl . IEICE 6, 360–363.Google Scholar

Rump, S. M. (2017), IEEE754 precision-k base-β arithmetic inherited by precision-m base-β arithmetic for $k<m$

, ACM Trans. Math. Softw. 43, 20:1–20:15.Google Scholar

Rump, S. M. (2019), Error bounds for computer arithmetics, in 26th IEEE Symposium on Computer Arithmetic, pp. 1–14.Google Scholar

Rump, S. M., Ogita, T. and Oishi, S. (2008), Accurate floating-point summation, I: Faithful rounding, SIAM J. Sci. Comput. 31, 189–224.CrossRef Google Scholar

Rump, S. M., Zimmermann, P., Boldo, S. and Melquiond, G. (2009), Computing predecessor and successor in rounding to nearest, BIT Numer. Math. 49, 419–431.CrossRef Google Scholar

Severance, C. (1998), IEEE 754: An interview with William Kahan, Computer 31, 114–115.CrossRef Google Scholar

Shewchuk, J. R. (1997), Adaptive precision floating-point arithmetic and fast robust geometric predicates, Discrete Comput. Geom. 18, 305–363.CrossRef Google Scholar

Shibata, N. and Petrogalli, F. (2020), SLEEF: A portable vectorized library of C standard mathematical functions, IEEE Trans. Parallel Distrib. Syst. 31, 1316–1327.CrossRef Google Scholar

Sibidanov, A., Zimmermann, P. and Glondu, S. (2022), The CORE-MATH project, in 29th IEEE Symposium on Computer Arithmetic, pp. 26–34.Google Scholar

Smith, R. L. (1962), Algorithm 116: Complex division, Commun . Assoc. Comput. Mach. 5, 435.Google Scholar

Steele, G. L. Jr and White, J. L. (2004), Retrospective: How to print floating-point numbers accurately, ACM SIGPLAN Notices 39, 372–389.CrossRef Google Scholar

Sterbenz, P. H. (1974), Floating-Point Computation, Prentice-Hall.Google Scholar

Stewart, G. W. (1985), A note on complex division, ACM Trans. Math. Softw. 11, 238–241.CrossRef Google Scholar

Strachey, C. (1959), On taking the square root of a complex number, Comput. J. 2, 89.CrossRef Google Scholar

Sun, X., Wang, N., Chen, C.-Y., Ni, J., Agrawal, A., Cui, X., Venkataramani, S., Maghraoui, K. El, Srinivasan, V. V. and Gopalakrishnan, K. (2020), Ultra-low precision 4-bit training of deep neural networks, in Advances in Neural Information Processing Systems 33 (Larochelle, H. et al., eds), Curran Associates, pp. 1796–1807.Google Scholar

Swartzlander, E. E. and Alexpoulos, A. G. (1975), The sign-logarithm number system, IEEE Trans. Comput. Reprinted in E. E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE, 1990.Google Scholar

Uguen, Y. and de Dinechin, F. (2017), Design-space exploration for the Kulisch accumulator. Available at https://hal.archives-ouvertes.fr/hal-01488916hal-01488916.Google Scholar

Veltkamp, G. W. (1968), ALGOL procedures voor het berekenen van een inwendig product in dubbele precisie. Technical report 22, RC-Informatie, Technische Hogeschool Eindhoven.Google Scholar

Veltkamp, G. W. (1969), ALGOL procedures voor het rekenen in dubbele lengte. Technical report 21, RC-Informatie, Technische Hogeschool Eindhoven.Google Scholar

Wang, S. and Kanwar, P. (2019), Bfloat16: The secret to high performance on cloud TPUs. Available at https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.Google Scholar

Whaley, R. C., Petitet, A. and Dongarra, J. J. (2001), Automated empirical optimizations of software and the ATLAS project, Parallel Comput. 27, 3–35.CrossRef Google Scholar

Wilkinson, J. H. (1960), Error analysis of floating-point computation, Numer . Math. 2, 319–340.Google Scholar

Wilkinson, J. H. (1961), Error analysis of direct methods of matrix inversion, J. Assoc. Comput. Mach. 8, 281–330.CrossRef Google Scholar

Wilkinson, J. H. (1963), Rounding Errors in Algebraic Processes, Notes on Applied Science no. 32, HMSO. Also published by Prentice-Hall. Reprinted by Dover, 1994.Google Scholar

Wilkinson, J. H. (1965), The Algebraic Eigenvalue Problem, Oxford University Press.Google Scholar

Wolfe, J. M. (1964), Reducing truncation errors by programming, Commun . Assoc. Comput. Mach. 7, 355–356.Google Scholar

Yamazaki, I., Tomov, S. and Dongarra, J. (2015), Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs, SIAM J. Sci. Comput. 37, C307–C330.CrossRef Google Scholar

Ziv, A. (1999), Sharp ULP rounding error bound for the hypotenuse function, Math. Comp. 68, 1143–1148.CrossRef Google Scholar

Article contents

Floating-point arithmetic

Abstract

MSC classification

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests