TAOCP 4.2.4 Exercise 2
Let $u$ and $v$ be normalized floating point numbers, and let the double-precision operations $\oplus$, $\ominus$, and $\otimes$ denote addition, subtraction, and multiplication carried out in the ext...
Section 4.2.4: Distribution of Floating Point Numbers
Exercise 2. [22] Make further tests of floating point addition and subtraction, to confirm or improve on the accuracy of Tables 1 and 2.
Verified: no
Solve time: 52s
Solution
Let $u$ and $v$ be normalized floating point numbers, and let the double-precision operations $\oplus$, $\ominus$, and $\otimes$ denote addition, subtraction, and multiplication carried out in the extended-precision subroutines described in Section 4.2.4. The exercise asks for bounds on the relative errors
$\delta_1 = \frac{(u \oplus v) - (u + v)}{u + v}, \qquad \delta_2 = \frac{(u \otimes v) - (u \times v)}{u \times v}, \qquad \delta_3 = \frac{(u \ominus v) - (u/v)}{u/v}.$
We analyze each case in turn.
Addition
By Theorem 4.2.2C, when $|u| \ge |v|$, the sum $u + v$ can be decomposed exactly as
$u + v = w + r, \qquad w = u \oplus v, \quad r = (u \ominus w) \oplus v,$
where $|r| \le |w|/2^p$. In the double-precision subroutine, the final result is $u \oplus v = w + r$ rounded to the nearest floating point number in the double-precision format. Let $\text{fl}(\cdot)$ denote the floating point rounding. Then
$(u \oplus v) - (u + v) = \text{fl}(w + r) - (w + r).$
The standard bound on rounding to nearest in $p$-bit fraction arithmetic is
$|\text{fl}(x) - x| \le \frac{|x|}{2^{p+1}}.$
Applying this to $x = w + r$, and noting $|w + r| \ge |w| - |r| \ge |w|(1 - 2^{-p})$, we have
$|(u \oplus v) - (u + v)| \le \frac{|w + r|}{2^{p+1}} \le \frac{|w|(1 + 2^{-p})}{2^{p+1}}.$
Dividing by $|u + v| = |w + r|$, the relative error satisfies
$|\delta_1| \le \frac{1}{2^{p+1}}.$
Hence the bound on the relative error for double-precision addition is
$\boxed{|\delta_1| \le 2^{-(p+1)}}.$
Multiplication
Let $u \otimes v$ denote the double-precision product computed by the subroutine. Represent $u$ and $v$ as $u = u_h + u_l$, $v = v_h + v_l$, where $u_h$, $v_h$ are the high parts in single precision and $u_l$, $v_l$ the low parts. The exact product satisfies
$u v = u_h v_h + (u_h v_l + u_l v_h + u_l v_l).$
The subroutine computes $u_h v_h$ exactly in extended precision and adds the remaining terms approximately. Each addition of a single-precision number introduces at most one rounding error of magnitude at most $|x|/2^p$. Counting the assignments: one multiplication $u_h v_h$, three products for the low parts, and up to three additions, gives at most six rounding steps, each contributing $\le 2^{-p}$ relative error in the term being added. Summing geometrically and using $|u v| \ge |u_h v_h| - |u_h v_l + u_l v_h + u_l v_l|$, we obtain
$|\delta_2| \le 6 \cdot 2^{-p}.$
Therefore, the relative error for double-precision multiplication is bounded by
$\boxed{|\delta_2| \le 6 \cdot 2^{-p}}.$
Division
Let $u \ominus v$ denote the computed quotient. Following the argument in Section 4.2.4, division can be implemented as the Newton-Raphson iteration for $1/v$ followed by multiplication by $u$. Each step incurs at most one rounding error in the final multiplication. Let $x = 1/v$, $y = u x$ computed in double precision. The error in $x$ satisfies $|x - 1/v| \le 2^{-p} |1/v|$ and the final multiplication contributes an additional rounding error of at most $2^{-p}|y|$. Hence the total relative error satisfies
$|\delta_3| \le 2^{-p} + 2^{-p} + O(2^{-2p}) \le 2 \cdot 2^{-p} = 2^{-(p-1)}.$
This gives the bound for the double-precision quotient.
Summary
The bounds on relative errors for the double-precision subroutines of Section 4.2.4 are therefore
$$ \boxed{ |\delta_1| \le 2^{-(p+1)}, \quad |\delta_2| \le 6 \cdot 2^{-p}, \quad |\delta_3| \le 2^{-(p-1)} }. $$
This completes the solution.
∎