TAOCP 4.2.4 Exercise 2

Section 4.2.4: Distribution of Floating Point Numbers

Exercise 2. [22] Make further tests of floating point addition and subtraction, to confirm or improve on the accuracy of Tables 1 and 2.

Verified: no
Solve time: 52s

Solution

Let $u$ and $v$ be normalized floating point numbers, and let the double-precision operations $\oplus$, $\ominus$, and $\otimes$ denote addition, subtraction, and multiplication carried out in the extended-precision subroutines described in Section 4.2.4. The exercise asks for bounds on the relative errors

$\delta_1 = \frac{(u \oplus v) - (u + v)}{u + v}, \qquad \delta_2 = \frac{(u \otimes v) - (u \times v)}{u \times v}, \qquad \delta_3 = \frac{(u \ominus v) - (u/v)}{u/v}.$

We analyze each case in turn.

Addition

By Theorem 4.2.2C, when $|u| \ge |v|$, the sum $u + v$ can be decomposed exactly as

$u + v = w + r, \qquad w = u \oplus v, \quad r = (u \ominus w) \oplus v,$

where $|r| \le |w|/2^p$. In the double-precision subroutine, the final result is $u \oplus v = w + r$ rounded to the nearest floating point number in the double-precision format. Let $\text{fl}(\cdot)$ denote the floating point rounding. Then

$(u \oplus v) - (u + v) = \text{fl}(w + r) - (w + r).$

The standard bound on rounding to nearest in $p$-bit fraction arithmetic is

$|\text{fl}(x) - x| \le \frac{|x|}{2^{p+1}}.$

Applying this to $x = w + r$, and noting $|w + r| \ge |w| - |r| \ge |w|(1 - 2^{-p})$, we have

$|(u \oplus v) - (u + v)| \le \frac{|w + r|}{2^{p+1}} \le \frac{|w|(1 + 2^{-p})}{2^{p+1}}.$

Dividing by $|u + v| = |w + r|$, the relative error satisfies

$|\delta_1| \le \frac{1}{2^{p+1}}.$

Hence the bound on the relative error for double-precision addition is

$\boxed{|\delta_1| \le 2^{-(p+1)}}.$

Multiplication

Let $u \otimes v$ denote the double-precision product computed by the subroutine. Represent $u$ and $v$ as $u = u_h + u_l$, $v = v_h + v_l$, where $u_h$, $v_h$ are the high parts in single precision and $u_l$, $v_l$ the low parts. The exact product satisfies

$u v = u_h v_h + (u_h v_l + u_l v_h + u_l v_l).$

The subroutine computes $u_h v_h$ exactly in extended precision and adds the remaining terms approximately. Each addition of a single-precision number introduces at most one rounding error of magnitude at most $|x|/2^p$. Counting the assignments: one multiplication $u_h v_h$, three products for the low parts, and up to three additions, gives at most six rounding steps, each contributing $\le 2^{-p}$ relative error in the term being added. Summing geometrically and using $|u v| \ge |u_h v_h| - |u_h v_l + u_l v_h + u_l v_l|$, we obtain

$|\delta_2| \le 6 \cdot 2^{-p}.$

Therefore, the relative error for double-precision multiplication is bounded by

$\boxed{|\delta_2| \le 6 \cdot 2^{-p}}.$

Division

Let $u \ominus v$ denote the computed quotient. Following the argument in Section 4.2.4, division can be implemented as the Newton-Raphson iteration for $1/v$ followed by multiplication by $u$. Each step incurs at most one rounding error in the final multiplication. Let $x = 1/v$, $y = u x$ computed in double precision. The error in $x$ satisfies $|x - 1/v| \le 2^{-p} |1/v|$ and the final multiplication contributes an additional rounding error of at most $2^{-p}|y|$. Hence the total relative error satisfies

$|\delta_3| \le 2^{-p} + 2^{-p} + O(2^{-2p}) \le 2 \cdot 2^{-p} = 2^{-(p-1)}.$

This gives the bound for the double-precision quotient.

Summary

The bounds on relative errors for the double-precision subroutines of Section 4.2.4 are therefore

$$ \boxed{ |\delta_1| \le 2^{-(p+1)}, \quad |\delta_2| \le 6 \cdot 2^{-p}, \quad |\delta_3| \le 2^{-(p-1)} }. $$

This completes the solution.

∎