TAOCP 4.2.1 Exercise 15

Section 4.2.1: Single-Precision Calculations

Exercise 15. ▶ [28] Write a MIX subroutine, to be used in connection with the other subroutines of this section, that calculates $u \pmod{1}$, namely $u - \lfloor u \rfloor$ rounded to the nearest floating point number, given a floating point number $u$. Notice that when $u$ is a very small negative number, $u \pmod{1}$ should be rounded so that the result is unity (even though $u \bmod 1$ has been defined to be always less than unity, as a real number).

Verified: yes
Solve time: 1m41s

Setup

We are to design a MIX subroutine that computes

$$ u \pmod 1 = u-\lfloor u\rfloor, $$

returning the result rounded to the nearest normalized floating point number. The routine is intended for use with the floating point package of Section 4.2.1, so the input is a normalized floating point number

$$ u=(e,f), $$

stored in the standard format, and the output should be returned in the same format.

The exercise calls attention to the case of a very small negative number. If

$$ u=-\varepsilon,\qquad 0<\varepsilon\ll1, $$

then

$$ u\pmod1=1-\varepsilon, $$

which rounds to unity when $\varepsilon$ is smaller than half a unit in the last place. Since the floating point system cannot represent the real number $1$, the correct rounded result is the normalized representation of unity,

$$ (b^{-1},e+1), $$

obtained by normalization.

Solution

The value of $u \pmod 1$ depends only on the magnitude of $u$ relative to the integers.

If $|u|<1$, there are two cases.

For $u\ge0$,

$$ u\pmod1=u, $$

so the original floating point number may be returned unchanged.

For $u<0$,

$$ u\pmod1=1+u. $$

The quantity $1$ is represented exactly as the normalized floating point number whose fraction is

$$ b^{-1} $$

and whose exponent is one greater than the exponent used for numbers in $[b^{-1},1)$. Therefore the required result is obtained by adding $u$ to this floating point representation of unity with the floating point addition routine FADD.

Now suppose $|u|\ge1$. Let

$$ u=s\sum_{k=1}^{p} d_k b^{e-k}, \qquad s=\pm1, $$

with $d_1\ne0$. The integer part consists of all digits whose weights are at least $b^0$. Hence the fractional part is determined entirely by the digits whose weights are negative powers of $b$.

Let

$$ m=e-1. $$

If $m\le0$, then $|u|<1$, already handled above. Assume $m>0$.

The integer part occupies the first $m$ significant digits. Therefore:

If $m\ge p$, every stored digit belongs to the integer part, and the represented number is already an exact integer. Hence

$$ u\pmod1=0. $$ 2. If $0<m<p$, the last $p-m$ digits form the fractional part. The integer part is removed by replacing the first $m$ fraction digits by zero and renormalizing the remainder.

For positive $u$ this remainder is exactly

$$ u-\lfloor u\rfloor . $$

For negative $u$, write

$$ u=-n-r, \qquad n\ge0,\quad 0<r<1. $$

Then

$$ u\pmod1 = u-\lfloor u\rfloor = (-n-r)-(-n-1) = 1-r. $$

The quantity $r$ is the positive fractional magnitude extracted above. Hence the desired answer is obtained by forming

$$ 1-r $$

with floating point subtraction.

A MIX implementation can therefore proceed as follows.

The exponent is unpacked and compared with the exponent corresponding to $1$.

If $|u|<1$, return $u$ when the sign is positive; otherwise load the floating point constant $1$ and invoke FSUB or FADD to form $1+u$.

If $|u|\ge1$, compute

$$ m=e-1. $$

When $m\ge p$, return floating point zero.

When $0<m<p$, mask out the first $m$ significant digits of the fraction, leaving only the fractional digits. Normalize the remainder with NORM.

If the original sign was positive, return this normalized remainder.

If the original sign was negative, subtract the normalized remainder from the floating point constant $1$ and return the result.

All arithmetic operations required after extraction of the fractional digits are already supplied by the routines FADD, FSUB, and NORM of Section 4.2.1.

Verification

For a positive number,

$$ u=n+r, \qquad n=\lfloor u\rfloor,\quad 0\le r<1. $$

The algorithm extracts exactly the digits corresponding to $r$, hence returns

$$ r=u-\lfloor u\rfloor. $$

For a negative noninteger,

$$ u=-n-r, \qquad n\ge0,\quad 0<r<1. $$

The extracted fractional magnitude is $r$. The algorithm returns

$$ 1-r, $$

and

$$ 1-r = (-n-r)-(-n-1) = u-\lfloor u\rfloor. $$

If $u$ is an integer, either $m\ge p$ or the extracted fractional digits are all zero. In both situations the algorithm returns $0$, which equals

$$ u-\lfloor u\rfloor. $$

For a very small negative number,

$$ u=-\varepsilon, \qquad 0<\varepsilon<\frac12 b^{-p}, $$

the exact value is

$$ u\pmod1=1-\varepsilon. $$

The nearest floating point number is the representation of unity, because the distance to $1$ is less than half a unit in the last place. The algorithm computes $1+u$ and relies on the standard normalization and rounding routine, therefore it produces precisely this result.

Hence every case yields the floating point number nearest to

$$ u-\lfloor u\rfloor. $$

Notes

The essential observation is that the operation requires no general division. After unpacking the exponent, the fractional part is obtained by discarding the digits whose weights are at least $b^0$. The existing addition, subtraction, and normalization subroutines of Section 4.2.1 then complete the computation. ∎