TAOCP 4.2.3 Exercise 3

Section 4.2.3: Double-Precision Calculations

Exercise 3. [M20] Explain why overflow cannot occur during Program M.

Verified: no
Solve time: 1m14s

Solution

Let

$$ u=(u_m+eu_l)b^{e_u}, \qquad v=(v_m+ev_l)b^{e_v}, $$

where $u_m$ and $v_m$ are the most-significant halves of the fractions, $u_l$ and $v_l$ are the least-significant halves, and

$$ e=b^{-5}. $$

Program M forms the product

$$ uv=(u_m+eu_l)(v_m+ev_l)b^{e_u+e_v} $$

by computing only those partial products that contribute to the leftmost eight fraction bytes. According to Fig. 4, the retained part of the product is obtained from

$$ u_m v_m,\qquad e,u_m v_l,\qquad e,u_l v_m, $$

while the term

$$ e^2u_l v_l $$

is discarded.

Each normalized half-fraction satisfies

$$ 0\le u_m<1,\qquad 0\le u_l<1, $$

and similarly for $v_m$ and $v_l$. Hence every partial product computed by Program M is strictly less than $1$.

The largest contribution comes from $u_m v_m$. Since both factors are less than $1$,

$$ u_m v_m<1. $$

The remaining retained contributions are multiplied by the factor $e=b^{-5}$, therefore

$$ e,u_m v_l<e, \qquad e,u_l v_m<e. $$

Consequently the quantity accumulated by the program before normalization is bounded by

$$ u_m v_m+e,u_m v_l+e,u_l v_m <1+e+e =1+2e. $$

Because $b\ge2$,

$$ e=b^{-5}\le\frac1{32}, $$

$$ 1+2e \le 1+\frac1{16} <2. $$

Thus the unnormalized fraction produced by the multiplication is always strictly less than $2$.

Overflow during the arithmetic operations of Program M would require a quantity of magnitude at least $2$ to arise in the accumulator. The bound above shows that this never happens. At worst, the product lies in the interval $[1,2)$ and requires a single right shift during normalization.

The exponent is adjusted only after the fraction has been formed and normalized. Exponent overflow is tested separately by the program; the exercise asks why overflow cannot occur during the multiplication itself. The reason is that the retained fraction is always less than $2$, so every intermediate partial sum fits within the available fraction field.

This completes the proof.

∎