# What Every Computer Scientist Should Know About Floating-Point Arithmetic

The main objective of this post is to outline the *foundational terminology* of
the *floating-point arithmetic*, as introduced in the paper What Every Computer
Scientist Should Know About Floating-Point Arithmetic (1991) by **David
Goldberg**. The paper is also known in its edited variation, as published by Sun.

This text follows the numbered structure of the original paper and uses the excerpts from the both (they are mostly identical).

For a programmer's starter on the topic, see also The Floating-Point Guide. It may be a sufficient source information in many – especially purely practical – cases.

## 1. Rounding Error

Squeezing infinitely many

real numbersinto a finite number ofbitsrequires anapproximate representation.

… when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard

^{1}.

This quote participates in an interesting turn at the end of the edited paper, in the added part by another author.

### 1.1 Floating-Point Formats

Several different representations of

real numbershave been proposed, but by far the most widely used is thefloating-point representation. Floating-point representations have abaseβ(which is always assumed to be even) and aprecisionρ.

The term

floating-point numberwill be used to mean areal numberthat can be exactly represented in the format under discussion.

There are two reasons why a

real numbermight not be exactly representable as afloating-point number. The most common situation is illustrated by the decimal number`0.1`

. Although it has a finite decimal representation, in binary it has an infinite repeating representation.

Requiring that a

floating-point representationbenormalized^{2}makes the representation unique.

### 1.2 Relative Error and Ulps

Since

rounding erroris inherent infloating-point computation, it is important to have a way to measure this error.

The term

ulpswill be used as shorthand for “units in the last place.”

Another way to measure the difference between a

floating-point numberand thereal numberit is approximating isrelative error, which is simply the difference between the two numbers divided by thereal number.

In particular, the

relative errorcorresponding to`1/2`

ulpcan vary by a factor ofβ. This factor is called thewobble. … when areal numberis rounded to the closestfloating-point number, therelative erroris always bounded byε, which is referred to as machine epsilon.

### 1.3 Guard Digits

floating-point hardwarenormally operates on a fixed number of digits.

… the

absolute errorcan be as large as the result …

… one extra digit is added to guard against this situation (a

guard digit).

### 1.4 Cancellation

Section 1.3can be summarized by saying that without aguard digit, therelative errorcommitted when subtracting two nearby quantities can be very large.

When subtracting nearby quantities, the most significant digits in the operands match and cancel each other. There are two kinds of

cancellation:catastrophicandbenign.

Catastrophic cancellationoccurs when the operands are subject to rounding errors. …cancellationcan cause many of the accurate digits to disappear, leaving behind mainly digits contaminated byrounding error.

Benign cancellationoccurs when subtracting exactly known quantities.

Sometimes a formula that gives inaccurate results can be rewritten to have much higher numerical accuracy by using

benign cancellation; however, the procedure only works if subtraction is performed using aguard digit.

### 1.5 Exactly Rounded Operations

Also commonly referred to as **correctly rounded**.

– Editor (Sun-published)

… when using the

round up rule, computations can gradually drift upward, whereas when usinground to even… this cannot happen.

But accurate operations are useful even in the face of inexact data, because they enable us to establish exact relationships …

## 2. IEEE Standard

There are two different

IEEE standardsforfloating-point computation.IEEE 754is a binary standard that requiresβ = 2,ρ = 24for single precision andρ = 53for double precision.

IEEE 854allows eitherβ = 2orβ = 10and unlike754, does not specify howfloating-point numbersare encoded into bits.

The term

IEEE Standardwill be used when discussing properties common to both standards.

### 2.1 Formats and Operations

#### 2.1.1 Base

In most modern hardware, the performance gained by avoiding a shift for a subset of operands is negligible, and so the small

wobbleofβ = 2makes it the preferablebase.

#### 2.1.2 Precision

The

IEEE standarddefines four differentprecisions: single, double, single extended, and double extended.

The standard puts the most emphasis on

extended precision, making no recommendation concerningdouble precision, but strongly recommending that

Implementations should support the extended format corresponding to the widest basic format supported, …

#### 2.1.3 Exponent

Two common methods of representing signed numbers are sign/magnitude and two’s complement.

The

IEEE binary standarddoes not use either of these methods to represent theexponent, but instead uses abiased representation^{3}.

#### 2.1.4 Operations

The

IEEE standardrequires that the result of addition, subtraction, multiplication and division beexactly rounded. That is, the result must be computed exactly and then rounded to the nearestfloating-point number(using round to even).

One reason for completely specifying the results of arithmetic operations is to improve the portability of software.

Another advantage of precise specification is that it makes it easier to reason about

floating-point.

There is not complete agreement on what operations a

floating-point standardshould cover.

### 2.2 Special Quantities

The

IEEE standard… hasNaNs(Not a Number) andinfinities. Without any special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number, other than aborting computation.

#### 2.2.1 NaNs

This problem can be avoided by introducing a special value called

NaN, and specifying that the computation of expressions like`0/0`

and`√–1`

produce NaN, rather than halting.

In general, whenever a NaN participates in a

floating-point operation, the result is another NaN.

#### 2.2.2 Infinity

…

infinitiesprovide a way to continue when an overflow occurs.

#### 2.2.3 Signed Zero

Since the sign bit can take on two different values, there are two zeros,

`+0`

and`-0`

.

#### 2.2.4 Denormalized Numbers

The

IEEE standarduses denormalized numbers, which guarantee

`x = y ⇔ x - y = 0`

,as well as other useful relations.

### 2.3 Exceptions, Flags, and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in

IEEE arithmetic, the default is to deliver a result and continue. … When any exception occurs, astatus flagis also set.

The

IEEE standardstrongly recommends that implementations allowtrap handlersto be installed. Then when an exception occurs, thetrap handleris called instead of setting theflag. The value returned by thetrap handlerwill be used as the result of the operation. It is the responsibility of thetrap handlerto either clear or set thestatus flag; otherwise, the value of theflagis allowed to be undefined.The

IEEE standarddividesexceptionsinto5 classes: overflow, underflow, division by zero, invalid operation and inexact. There is a separatestatus flagfor each class of exception.

#### 2.3.1 Trap Handlers

One obvious use for

trap handlersis for backward compatibility.

… a more interesting use … comes up when computing products … that could potentially overflow. … solution using

trap handlerscalledover / underflow counting.

#### 2.3.2 Rounding Modes

By default, rounding means round toward nearest. The standard requires that three other rounding modes be provided, namely round toward

`0`

, round toward`+∞`

, and round toward`-∞`

.

## 3. Systems Aspects

The design of almost every aspect of a computer system requires knowledge about

floating-point. Computer architectures usually havefloating-point instructions, compilers must generate thosefloating-point instructions, and the operating system must decide what to do when exception conditions are raised for thosefloating-point instructions.

### 3.1 Instruction Sets

… modern instruction sets tend to provide only instructions that produce a result of the same

precisionas the operands.

… instructions that multiply two

floating-point numbersand return a product with twice theprecisionof the operands make a useful addition to afloating-point instruction set.

### 3.2 Languages and Compilers

#### 3.2.1 Ambiguity

Ideally, a language definition should define the semantics of the language precisely enough to prove statements about programs. Whereas this is usually true for the integer part of a language, language definitions often have a large grey area when it comes to

floating-point.^{4}

#### 3.2.2 IEEE Standard

… there is usually a mismatch between

floating-point hardwarethat supports the standard and programming languages …

#### 3.2.3 Optimizers

… there are useful optimizations that can be done on

floating-point code.

### 3.3 Exception Handling

Trap handlers… raise some interesting systems issues.

Hardware support for identifying exactly which operation trapped may be necessary.

Kahan has proposed using

presubstitution^{5}instead oftrap handlersto avoid these problems. In this method, the user specifies an exception and a value to be used as the result when the exception occurs.

The advantage of

presubstitutionis that it has a straightforward hardware implementation. … the widespread acceptance of theIEEE standardmakes it unlikely to be widely implemented by hardware manufacturers.

## 4. Details

We now proceed to show that

floating-pointis not`black magic`

, but rather is a straightforward subject whose claims can be verified mathematically.

## 5. Summary

This paper has demonstrated that it is possible to reason rigorously about floating point.

The increasing acceptance of the

IEEE floating-point standardmeans that codes that use features of the standard are becoming even more portable.

## Differences Among IEEE 754 Implementations (Sun edition only)

A taste of inevitability: We're on the loop here.

Unfortunately, the **IEEE standard** does not guarantee that the same program will
deliver identical result on all conforming systems. Most programs will actually
produce different results on different systems for a variety of reasons.

Should've expected this, shouldn't we?

## Footnotes:

^{2}

"In applied mathematics, a number is normalized when it is written in scientific notation with one non-zero decimal digit before the decimal point."

^{3}

"… the minimal negative value is represented by all-zeros, the 'zero' value is represented by a 1 in the most significant bit and zero in all other bits, and the maximal positive value is represented by all-ones …"

^{5}

Kahan, W. M. (5 July 2005). "A Demonstration of Presubstitution for ∞/∞".

**Exceptions** become **Errors** only when mishandled.

In June 1996 the Ariane V rocket turned cartwheels and blew up half a billion dollars worth of instruments intended for European science in space. The proximate cause was the programming language ADA’s policy of aborting computation when an Arithmetic Error, in this case an irrelevant Floating-Point → Integer Overflow, occurred.

In Sept. 1997 the Aegis missile-cruiser Yorktown spent almost three hours adrift off Cape Charles VA, its software-controlled propulsion and steering disabled, waiting for Microsoft Windows NT 4.0 to be rebooted after a division-by-zero unexpectedly trapped into it from a data-base program that had interpreted an accidentally blank field as zero.