11 November 2022

What Every Computer Scientist Should Know About Foating-Point Arithmetic

The main objective of this post is to outline the foundational terminology of the floating-point arithmetic, as introduced in the paper What Every Computer Scientist Should Know About Foating-Point Arithmetic (1991) by David Goldberg. The paper is also known in its edited variation, as published by Sun.

This text follows the numbered structure of the original paper and uses the excerpts from the both (they are mostly identical).

For a programmer's starter on the topic, see also The Floating-Point Guide. It may be a sufficient source information in many – especially purely practical – cases.

1. Rounding Error

Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation.

… when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard 1.

This quote participates in an interesting turn at the end of the edited paper, in the added part by another author.

1.1 Floating-Point Formats

Several different representations of real numbers have been proposed, but by far the most widely used is the floating-point representation. Floating-point representations have a base β (which is always assumed to be even) and a precision ρ.

The term floating-point number will be used to mean a real number that can be exactly represented in the format under discussion.

There are two reasons why a real number might not be exactly representable as a floating-point number. The most common situation is illustrated by the decimal number 0.1. Although it has a finite decimal representation, in binary it has an infinite repeating representation.

Requiring that a floating-point representation be normalized 2 makes the representation unique.

1.2 Relative Error and Ulps

Since rounding error is inherent in floating-point computation, it is important to have a way to measure this error.

The term ulps will be used as shorthand for “units in the last place.”

Another way to measure the difference between a floating-point number and the real number it is approximating is relative error, which is simply the difference between the two numbers divided by the real number.

In particular, the relative error corresponding to 1/2 ulp can vary by a factor of β. This factor is called the wobble. … when a real number is rounded to the closest floating-point number, the relative error is always bounded by ε, which is referred to as machine epsilon.

1.3 Guard Digits

floating-point hardware normally operates on a fixed number of digits.

… the absolute error can be as large as the result …

… one extra digit is added to guard against this situation (a guard digit).

1.4 Cancellation

Section 1.3 can be summarized by saying that without a guard digit, the relative error committed when subtracting two nearby quantities can be very large.

When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of cancellation: catastrophic and benign.

Catastrophic cancellation occurs when the operands are subject to rounding errors. … cancellation can cause many of the accurate digits to disappear, leaving behind mainly digits contaminated by rounding error.

Benign cancellation occurs when subtracting exactly known quantities.

Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using benign cancellation; however, the procedure only works if subtraction is performed using a guard digit.

1.5 Exactly Rounded Operations

Also commonly referred to as correctly rounded.

– Editor (Sun-published)

… when using the round up rule, computations can gradually drift upward, whereas when using round to even … this cannot happen.

But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships …

2. IEEE Standard

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires β = 2, ρ = 24 for single precision and ρ = 53 for double precision.

IEEE 854 allows either β = 2 or β = 10 and unlike 754, does not specify how floating-point numbers are encoded into bits.

The term IEEE Standard will be used when discussing properties common to both standards.

2.1 Formats and Operations

2.1.1 Base

In most modern hardware, the performance gained by avoiding a shift for a subset of operands is negligible, and so the small wobble of β = 2 makes it the preferable base.

2.1.2 Precision

The IEEE standard defines four different precisions: single, double, single extended, and double extended.

The standard puts the most emphasis on extended precision, making no recommendation concerning double precision, but strongly recommending that

Implementations should support the extended format corresponding to the widest basic format supported, …

2.1.3 Exponent

Two common methods of representing signed numbers are sign/magnitude and two’s complement.

The IEEE binary standard does not use either of these methods to represent the exponent, but instead uses a biased representation 3.

2.1.4 Operations

The IEEE standard requires that the result of addition, subtraction, multiplication and division be exactly rounded. That is, the result must be computed exactly and then rounded to the nearest floating-point number (using round to even).

One reason for completely specifying the results of arithmetic operations is to improve the portability of software.

Another advantage of precise specification is that it makes it easier to reason about floating-point.

There is not complete agreement on what operations a floating-point standard should cover.

2.2 Special Quantities

The IEEE standard … has NaNs (Not a Number) and infinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation.

2.2.1 NaNs

This problem can be avoided by introducing a special value called NaN, and specifying that the computation of expressions like 0/0 and √–1 produce NaN, rather than halting.

In general, whenever a NaN participates in a floating-point operation, the result is another NaN.

2.2.2 Infinity

infinities provide a way to continue when an overflow occurs.

2.2.3 Signed Zero

Since the sign bit can take on two different values, there are two zeros, +0 and -0.

2.2.4 Denormalized Numbers

The IEEE standard uses denormalized numbers, which guarantee

x = y ⇔ x - y = 0 ,

as well as other useful relations.

2.3 Exceptions, Flags, and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. … When any exception occurs, a status flag is also set.

The IEEE standard strongly recommends that implementations allow trap handlers to be installed. Then when an exception occurs, the trap handler is called instead of setting the flag. The value returned by the trap handler will be used as the result of the operation. It is the responsibility of the trap handler to either clear or set the status flag; otherwise, the value of the flag is allowed to be undefined.

The IEEE standard divides exceptions into 5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separate status flag for each class of exception.

2.3.1 Trap Handlers

One obvious use for trap handlers is for backward compatibility.

… a more interesting use … comes up when computing products … that could potentially overflow. … solution using trap handlers called over / underflow counting.

2.3.2 Rounding Modes

By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward 0, round toward +∞, and round toward -∞.

3. Systems Aspects

The design of almost every aspect of a computer system requires knowledge about floating-point. Computer architectures usually have floating-point instructions, compilers must generate those floating-point instructions, and the operating system must decide what to do when exception conditions are raised for those floating-point instructions.

3.1 Instruction Sets

… modern instruction sets tend to provide only instructions that produce a result of the same precision as the operands.

… instructions that multiply two floating-point numbers and return a product with twice the precision of the operands make a useful addition to a floating-point instruction set.

3.2 Languages and Compilers

3.2.1 Ambiguity

Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs. Whereas this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to floating-point. 4

3.2.2 IEEE Standard

… there is usually a mismatch between floating-point hardware that supports the standard and programming languages …

3.2.3 Optimizers

… there are useful optimizations that can be done on floating-point code.

3.3 Exception Handling

Trap handlers … raise some interesting systems issues.

Hardware support for identifying exactly which operation trapped may be necessary.

Kahan has proposed using presubstitution 5 instead of trap handlers to avoid these problems. In this method, the user specifies an exception and a value to be used as the result when the exception occurs.

The advantage of presubstitution is that it has a straightforward hardware implementation. … the widespread acceptance of the IEEE standard makes it unlikely to be widely implemented by hardware manufacturers.

4. Details

We now proceed to show that floating-point is not black magic, but rather is a straightforward subject whose claims can be verified mathematically.

5. Summary

This paper has demonstrated that it is possible to reason rigorously about floating point.

The increasing acceptance of the IEEE floating-point standard means that codes that use features of the standard are becoming even more portable.

Differences Among IEEE 754 Implementations (Sun edition only)

A taste of inevitability: We're on the loop here.

Unfortunately, the IEEE standard does not guarantee that the same program will deliver identical result on all conforming systems. Most programs will actually produce different results on different systems for a variety of reasons.

Should've expected this, shouldn't we?



Wikipedia: Normalized number.

"In applied mathematics, a number is normalized when it is written in scientific notation with one non-zero decimal digit before the decimal point."


Wikipedia: Offset binary.

"… the minimal negative value is represented by all-zeros, the 'zero' value is represented by a 1 in the most significant bit and zero in all other bits, and the maximal positive value is represented by all-ones …"


Kahan, W. M. (5 July 2005). "A Demonstration of Presubstitution for ∞/∞".

Exceptions become Errors only when mishandled.

In June 1996 the Ariane V rocket turned cartwheels and blew up half a billion dollars worth of instruments intended for European science in space. The proximate cause was the programming language ADA’s policy of aborting computation when an Arithmetic Error, in this case an irrelevant Floating-Point → Integer Overflow, occurred.

In Sept. 1997 the Aegis missile-cruiser Yorktown spent almost three hours adrift off Cape Charles VA, its software-controlled propulsion and steering disabled, waiting for Microsoft Windows NT 4.0 to be rebooted after a division-by-zero unexpectedly trapped into it from a data-base program that had interpreted an accidentally blank field as zero.

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Created by Y.E.T.If you see an error, please report it.