TAOCP 4.2.2 Exercise 19

Let t_k=s_k-c_k .

Section 4.2.2: Accuracy of Floating Point Arithmetic

Exercise 19. ▶ [M30] (W. M. Kahan.) Consider the following procedure for floating point summation of $x_1, x_2, \ldots, x_n$:

$$s_0 = c_0 = 0;$$

$$y_k = x_k \ominus c_{k-1}, \qquad s_k = s_{k-1} \oplus y_k, \qquad c_k = (s_k \ominus s_{k-1}) \ominus y_k, \qquad \text{for } 1 \le k \le n.$$

Let the relative errors in these operations be defined by the equations

$$y_k = (x_k - c_{k-1})(1 + \mu_k), \qquad s_k = (s_{k-1} + y_k)(1 + \sigma_k),$$ $$c_k = ((s_k - s_{k-1})(1 + \gamma_k) - y_k)(1 + \delta_k),$$

where $|\mu_k|, |\sigma_k|, |\gamma_k|, |\delta_k| \le \epsilon$. Prove that $s_n - c_n = \sum_{k=1}^n (1 + \theta_k) x_k$, where $|\theta_k| \le 2\epsilon + O(\epsilon^2)$. [Theorem C says that if $b = 2$ and $|s_{k-1}| \ge |y_k|$ we have $s_{k-1} + y_k = s_k - c_k$ exactly. But in this exercise we want to obtain an estimate that is valid even when floating point operations are not carefully rounded, assuming only that each operation has bounded relative error.]

Verified: no
Solve time: 6m29s


Let

$$ t_k=s_k-c_k . $$

We will prove by induction that $t_k$ and $c_k$ admit coefficient representations

$$ t_k=\sum_{j=1}^k (1+\alpha_{k,j})x_j,\qquad c_k=\sum_{j=1}^k \beta_{k,j}x_j, $$

with

$$ \alpha_{k,j}=O(\epsilon),\qquad \beta_{k,j}=O(\epsilon), $$

and such that

$$ \alpha_{k,j}=\mu_j-\gamma_j+O(\epsilon^2) $$

for every $j\le k$. Taking $k=n$ will then give

$$ s_n-c_n=t_n=\sum_{j=1}^n (1+\theta_j)x_j, \qquad \theta_j=\mu_j-\gamma_j+O(\epsilon^2), $$

hence

$$ |\theta_j| \le |\mu_j|+|\gamma_j|+O(\epsilon^2) \le 2\epsilon+O(\epsilon^2). $$

The essential point is to keep track of coefficients of the individual $x_j$.

1. Expansions for $t_k$ and $c_k$

From

$$ y_k=(x_k-c_{k-1})(1+\mu_k), $$

$$ s_k=(s_{k-1}+y_k)(1+\sigma_k), $$

$$ c_k=\Bigl((s_k-s_{k-1})(1+\gamma_k)-y_k\Bigr)(1+\delta_k), $$

we retain all first-order terms and absorb products of error quantities into $O(\epsilon^2)$.

First,

$$ s_k-s_{k-1} = y_k+\sigma_k(s_{k-1}+y_k). $$

Hence

$$ s_k-s_{k-1}-y_k = \sigma_k(s_{k-1}+y_k). $$

Substituting into the formula for $c_k$,

$$ \begin{aligned} c_k &= \Bigl( s_k-s_{k-1}-y_k +\gamma_k(s_k-s_{k-1}) \Bigr)(1+\delta_k) \ &= \sigma_k(s_{k-1}+y_k) +\gamma_k(s_k-s_{k-1}) +O(\epsilon^2). \end{aligned} $$

Since

$$ s_k-s_{k-1}=y_k+O(\epsilon), $$

we obtain

$$ c_k = \sigma_k(s_{k-1}+y_k)+\gamma_k y_k+O(\epsilon^2). \tag{1} $$

Next,

$$ \begin{aligned} t_k &=s_k-c_k \ &= s_{k-1}+y_k -\gamma_k(s_k-s_{k-1}) +O(\epsilon^2) \ &= s_{k-1}+y_k-\gamma_k y_k+O(\epsilon^2). \end{aligned} $$

Using $s_{k-1}=t_{k-1}+c_{k-1}$,

$$ t_k = t_{k-1} +c_{k-1} +y_k(1-\gamma_k) +O(\epsilon^2). $$

Since

$$ y_k=(x_k-c_{k-1})(1+\mu_k), $$

$$ \begin{aligned} t_k &= t_{k-1} +c_{k-1} +(x_k-c_{k-1})(1+\mu_k-\gamma_k) +O(\epsilon^2) \ &= t_{k-1} +(1+\mu_k-\gamma_k)x_k -(\mu_k-\gamma_k)c_{k-1} +O(\epsilon^2). \end{aligned} \tag{2} $$

Equation (2) is the key recurrence.

2. Induction hypothesis

Assume that for some $k-1\ge0$,

$$ t_{k-1} = \sum_{j=1}^{k-1}(1+\alpha_{k-1,j})x_j, \tag{H1} $$

$$ c_{k-1} = \sum_{j=1}^{k-1}\beta_{k-1,j}x_j, \tag{H2} $$

where

$$ \alpha_{k-1,j} = \mu_j-\gamma_j+O(\epsilon^2), \tag{H3} $$

$$ \beta_{k-1,j}=O(\epsilon). \tag{H4} $$

The basis $k=0$ is trivial since $t_0=c_0=0$.

We show that the same properties hold at step $k$.

3. Control of the $c_k$-coefficients

From (1),

$$ c_k = \sigma_k(s_{k-1}+y_k)+\gamma_k y_k+O(\epsilon^2). $$

Since

$$ s_{k-1}=t_{k-1}+c_{k-1}, $$

and by (H1), (H2),

$$ s_{k-1} = \sum_{j=1}^{k-1} \bigl(1+\alpha_{k-1,j}+\beta_{k-1,j}\bigr)x_j . $$

Also,

$$ y_k=(x_k-c_{k-1})(1+\mu_k) = x_k-\sum_{j=1}^{k-1}\beta_{k-1,j}x_j +O(\epsilon). $$

Therefore $s_{k-1}+y_k$ and $y_k$ are linear combinations of the $x_j$ whose coefficients are $O(1)$.

Multiplying by $\sigma_k$ or $\gamma_k$, each coefficient becomes $O(\epsilon)$. Hence

$$ c_k = \sum_{j=1}^{k}\beta_{k,j}x_j, \qquad \beta_{k,j}=O(\epsilon). \tag{3} $$

This establishes the required coefficientwise bound for $c_k$.

4. Control of the $t_k$-coefficients

Insert (H2) into (2):

$$ t_k = t_{k-1} +(1+\mu_k-\gamma_k)x_k -\sum_{j=1}^{k-1} (\mu_k-\gamma_k)\beta_{k-1,j}x_j +R_k, $$

where $R_k$ is a linear combination of the $x_j$ whose coefficients are $O(\epsilon^2)$.

The existence of such a representation follows from (H1), (H2): every omitted term contains at least two error factors, and all coefficients occurring in $t_{k-1}$ and $c_{k-1}$ are $O(1)$ or $O(\epsilon)$. Therefore each coefficient contributed by $R_k$ is $O(\epsilon^2)$, uniformly in $k$.

Using (H1),

$$ \begin{aligned} t_k &= \sum_{j=1}^{k-1} \Bigl( 1+\alpha_{k-1,j} -(\mu_k-\gamma_k)\beta_{k-1,j} +O(\epsilon^2) \Bigr)x_j \ &\qquad + (1+\mu_k-\gamma_k+O(\epsilon^2))x_k . \end{aligned} $$

Since $\beta_{k-1,j}=O(\epsilon)$,

$$ (\mu_k-\gamma_k)\beta_{k-1,j}=O(\epsilon^2). $$

Hence for $j<k$,

$$ \alpha_{k,j} = \alpha_{k-1,j}+O(\epsilon^2), $$

while

$$ \alpha_{k,k} = \mu_k-\gamma_k+O(\epsilon^2). $$

Using (H3),

$$ \alpha_{k,j} = \mu_j-\gamma_j+O(\epsilon^2) \qquad (1\le j\le k). \tag{4} $$

Thus

$$ t_k = \sum_{j=1}^{k}(1+\alpha_{k,j})x_j, $$

with (4). The induction is complete.

5. Conclusion

Taking $k=n$,

$$ t_n = \sum_{j=1}^{n} \Bigl( 1+\mu_j-\gamma_j+O(\epsilon^2) \Bigr)x_j . $$

Define

$$ \theta_j=\mu_j-\gamma_j+O(\epsilon^2). $$

Then

$$ s_n-c_n=t_n = \sum_{j=1}^{n}(1+\theta_j)x_j . $$

Finally,

$$ |\theta_j| \le |\mu_j|+|\gamma_j|+O(\epsilon^2) \le 2\epsilon+O(\epsilon^2). $$

Therefore

$$ \boxed{ s_n-c_n = \sum_{j=1}^{n}(1+\theta_j)x_j, \qquad |\theta_j| \le 2\epsilon+O(\epsilon^2). } $$

This is the required result. ∎