TAOCP 4.2.1 Exercise 17

Section 4.2.1: Single-Precision Calculations

Exercise 17. [40] (John Cocke.) Explore the idea of extending the range of floating point numbers by defining a single-word representation in which the precision of the fraction decreases as the magnitude of the exponent increases.

Verified: yes
Solve time: 1m41s

Setup

A floating point number is represented by a single word containing a fraction and exponent. Exercise 17 asks for an exploration of a modified representation in which the exponent and fraction share the same word, and in which larger exponents leave fewer bits available for the fraction. The objective is to increase the representable range while retaining a single-word format.

Let the machine word contain $W$ binary digits. We seek a representation of real numbers

$$ x=\pm m,2^e, $$

in which the number of digits used for $m$ depends on $|e|$. The problem is to describe such a system, determine its range and precision characteristics, and compare it with ordinary floating point notation.

Solution

A natural method is to encode the exponent in a variable-length form. Let a nonzero number be represented by

$$ \pm,1,\underbrace{11\cdots1}_{k\text{ bits}}0,d_1d_2\cdots d_t, $$

where the initial run of $k$ ones followed by a zero specifies the magnitude of the exponent, and the remaining $t$ bits specify the fraction.

Since the total word length is fixed,

$$ 1+(k+1)+t=W, $$

hence

$$ t=W-k-2. $$

The exponent magnitude is determined by $k$. For example, define

$$ |e|=2^k-1, $$

and let one additional sign bit determine the sign of $e$. Then larger exponents require larger values of $k$, leaving fewer bits for the fraction.

For a given $k$, the precision is

$$ t=W-k-2. $$

The relative spacing between adjacent normalized numbers of that exponent class is therefore

$$ 2^{-t} =2^{-(W-k-2)}. $$

Thus precision deteriorates exponentially with increasing $k$.

The largest exponent representable occurs when only the minimum number of fraction bits remain. If at least one fraction bit is required,

$$ t\ge1, $$

$$ k\le W-3. $$

Hence

$$ |e|_{\max} =2^{W-3}-1. $$

The largest representable magnitude is approximately

$$ 2^{,2^{W-3}-1}, $$

which is enormously larger than the range obtainable from a fixed-size exponent field. In an ordinary floating point system with $a$ exponent bits and $W-a-1$ fraction bits, the maximum exponent is only of order $2^a$.

The price paid for this increased range is reduced precision. When

$$ |e|=2^k-1, $$

only

$$ W-k-2 $$

fraction bits remain. Consequently numbers of very large magnitude are represented with very few significant digits.

A different but equivalent formulation uses a tapered floating point system. Let the word contain a signed integer

$$ n. $$

Interpret $n$ by writing its binary expansion in normalized form

$$ n=\pm,2^e(1.f). $$

The exponent length automatically increases with $|n|$, while the available fraction length decreases. Consecutive integers then correspond to consecutive floating point numbers. Near zero the spacing is extremely fine, because many bits remain available for the significand. Far from zero the spacing becomes coarser.

This scheme has several desirable properties.

First, the representable range grows doubly exponentially with the number of exponent-length bits. A single word can cover an immense dynamic range.

Second, numbers near unity possess nearly full precision. Most scientific computations involve quantities whose exponents are moderate; such numbers retain almost all of the available word length for the fraction.

Third, underflow and overflow become much less frequent. Instead of reserving a fixed portion of the word for the exponent, the representation allocates exponent information only when it is needed.

The disadvantage is that arithmetic operations become more complicated. Addition requires decoding variable-length exponents before alignment can occur. Multiplication and division require renormalization in a format whose precision is not fixed. Error analysis also becomes less uniform, because the relative error bound depends on the exponent class.

Therefore the system exchanges simplicity and uniform precision for greatly increased range. Numbers of ordinary size are represented nearly as accurately as in conventional floating point arithmetic, while extremely large or small numbers remain representable at reduced precision.

Verification

For exponent class $k$,

$$ t=W-k-2, $$

because one sign bit, $k$ unary exponent bits, and one terminating zero consume $k+2$ positions of the word.

The relative spacing between adjacent values in that class equals one unit in the last fraction place:

$$ 2^{-t} =2^{-(W-k-2)}. $$

As $k$ increases, the spacing increases and precision decreases.

The maximum exponent occurs when $t=1$:

$$ k=W-3, $$

giving

$$ |e|_{\max}=2^{W-3}-1. $$

Hence the range increases dramatically compared with a fixed-field exponent representation, while precision decreases correspondingly.

Notes

The idea developed here is essentially a tapered floating point representation. Fixed-length floating point allocates a constant number of bits to the exponent and significand. A tapered system allows these allocations to vary from number to number. John Cocke's suggestion anticipates later variable-precision floating point schemes and the modern concept of unum-like representations, where precision is concentrated near the numbers most frequently encountered and sacrificed only when exceptionally large range is required.