TAOCP 4.2.2 Exercise 31

Section 4.2.2: Accuracy of Floating Point Arithmetic

Exercise 31. [M25] (U. Kulisch.) When the polynomial $2y^2 + 9x^2 - y^4$ is evaluated for $x = .408855776$ and $y = .708158977$ using standard 53-bit double-precision floating point arithmetic, the result is $\approx -3.7 \times 10^{10}$. Evaluating it in the alternative form $2y^2 + (3x^2 - y^2)(3x^2 + y^2)$ gives $\approx +1.0 \times 10^{10}$. The true answer, however, is 1.0 (exactly). Explain how to construct similar examples of numerical instability.

Verified: no
Solve time: 9m51s

Solution

The point of Kulisch's example is not merely that a subtraction of nearly equal quantities occurs. The numbers are chosen so that the exact value of the polynomial is very small, while the floating-point evaluation of the constituent terms is grossly inaccurate. The small exact value results from an algebraic identity that holds over the real numbers but is destroyed by rounding.

A systematic way to construct such examples is as follows.

Let $a$ and $b$ be floating-point numbers. Compute their squares exactly and write

$$ a^2=A+\alpha,\qquad b^2=B+\beta, $$

where $A=\operatorname{fl}(a^2)$, $B=\operatorname{fl}(b^2)$, and $\alpha,\beta$ are the rounding errors.

Choose $a$ and $b$ so that $\alpha$ and $\beta$ are as large as possible, namely about one-half ulp of the corresponding squares.

Now consider the polynomial

$$ P(x,y)=x+y-(A+B). $$

At the point

$$ x=a^2,\qquad y=b^2, $$

its exact value is

$$ P(a^2,b^2) =(A+\alpha)+(B+\beta)-(A+B) =\alpha+\beta. $$

If the rounding errors are chosen so that

$$ \alpha+\beta=1, $$

then the exact value of the polynomial is exactly $1$.

However, when $P$ is evaluated in floating-point arithmetic using the values $x=a^2$ and $y=b^2$, the quantities $x$ and $y$ are first rounded to $A$ and $B$. Thus the machine effectively computes

$$ A+B-(A+B)=0. $$

The entire quantity $1=\alpha+\beta$ has disappeared. The polynomial has exact value $1$, but floating-point evaluation gives $0$.

This already exhibits the basic mechanism.

To obtain a more dramatic example, one arranges for the rounding errors to be very large integers. This is possible because the spacing between adjacent floating-point numbers grows with magnitude. If $a^2$ and $b^2$ are around $2^{m}$, then the rounding errors can be of size approximately $2^{m-53}$. By taking $m$ large, one can make $\alpha$ and $\beta$ enormous while still prescribing

$$ \alpha+\beta=1. $$

Then the exact polynomial value remains $1$, but the individual rounded terms are wrong by quantities of order $2^{m-53}$.

The next step is to encode the same identity into different algebraically equivalent forms. For example,

$$ x^2-y^2=(x-y)(x+y). $$

Suppose $x$ and $y$ are chosen so that the exact difference $x^2-y^2$ is small, while the separately computed quantities $x^2$ and $y^2$ each contain large rounding errors. Then

$$ (x^2-y^2)+C $$

and

$$ (x-y)(x+y)+C $$

are identical over the reals, but floating-point evaluation propagates the rounding errors differently. One form may produce a huge negative result, another a huge positive result, although the exact value is $1$.

This is exactly the pattern seen in Kulisch's example. The polynomial

$$ 2y^2+9x^2-y^4 $$

can be rewritten as

$$ 2y^2+(3x^2-y^2)(3x^2+y^2). $$

The values of $x$ and $y$ were selected so that the exact result is

$$ 1, $$

while the quantities $3x^2$ and $y^2$ are represented with large rounding errors. In one evaluation scheme these errors combine in one way, in the other scheme they combine differently. Since the exact answer arises from a delicate cancellation of large quantities, the two floating-point computations can produce enormous numbers of opposite signs even though the true value is $1$.

Thus the general construction is:

Choose floating-point numbers whose squares or other powers incur large rounding errors.
Arrange an algebraic identity so that those rounding errors cancel exactly in real arithmetic, leaving a small prescribed value, such as $1$.
Express the identity in two or more algebraically equivalent forms.
Evaluate the forms in floating-point arithmetic. Because rounding occurs at different places, the cancellation is no longer exact, and the various forms can yield wildly different answers.

This procedure systematically generates examples of the same type as Kulisch's, with exact value $1$ but arbitrarily large floating-point errors. ∎