Extended precision

Thu Jan 19 14:41:17 PST 1995

Jonathan Shewchuk writes:
>     Assume |a| >= |b| (if not, swap a and b)
>     x := a + b
>     e := x - a
>     y := b - e
> 
> This simple algorithm is guaranteed to produce a y such that a+b = x+y
> _exactly_.  (In other words, y is the roundoff error associated with the
> double-precision floating-point statement "x := a + b".)  Furthermore,
> x and y are both double-precision values, and y is no greater than half
> an ulp of x.  I might try to make this procedure work on a 486DX in C by
> declaring x to be a "volatile" variable, so that the result of the addition
> is stored to memory and then reloaded before execution of the first
> subtraction.  However, the procedure can fail if the addition is computed
> by first rounding the result to 80 bits of precision, and then to 53 bits.

and David G. Hough answers:
> I think you could count on this to always produce x+y == a+b if you could
> count on the compiler forcing storage to rounding precision on explicit
> assignments.    volatile doesn't always seem to accomplish that.

Even if every intermediate result is stored, the above algorithm does not
always produce x and y satisfying x + y = a + b exactly if double-rounding
occurs.  For example, take a = 2^53 + 2, b = 1 - 2^-53.  Then x rounds
first to 2^53 + 3, then to 2^53 + 4 (by the round-ties-to-even rule).
Consequently, e = 2, and y rounds first to 1 + 2^-53 (which is indeed the
exact difference between x and a + b), then rounds again to 1.

Jonathan Shewchuk continues:
> I wonder if it is possible to obtain the desired results on a 486DX,
> especially if I limit myself to C code that is portable across a wide
> variety of processors (which I suspect would remove the option of
> switching the 486DX into some sort of double-precision mode).

Provided the compiler stores and reloads each result which is assigned
to a double precision variable, there are several portable ways to obtain
a sum and error on Intel as well as other IEEE systems.  Here is one:

   if (|a| < |b|) swap(a, b);
   x = a + b;
   e = x - a;
   y = b - e;
   if (y + e != b) {x = a; y = b;}

(Note that this algorithm does not ensure that x is the sum of a and b
as it would be computed in correctly rounded double precision; in par-
ticular, y can be a little larger than half an ulp of x, though it will
be smaller than one ulp of x).

I am not aware of any method portable to all IEEE 754 implementations
that doesn't cost at least as much as the preceding one, however.

Patrice Roussel adds:
> This is not a flaw. The x86 FP architecture offers more than what IEEE
> does. The reason why overflow/underflow thresholds do not depend on the
> PC value (they are set to double extended), is to allow people to do
> such things as calculating the length of a representable vector without
> any overflow/underflow during the computation of the squares. Everybody
> knows how complex the Berkeley expression for vector length is! With
> a x87 product, you just write:
> 
>                      --  2
> 	|x| = sqrt ( >  x  )
> 	     	     --  i
> 		     i

The x86 architecture is simply one implementation which conforms to the
IEEE 754 standard.  There are others, and one cannot say which is "best"
without knowing something about how a given program depends on the
various features of each.  In your example, for instance, if the goal
is to compute the norm without raising any unwarranted exceptions, we
should leave the precision control set to double extended precision and
keep all the intermediate sums of squares in registers.  On the other
hand, if the goal is to approximate the norm and simultaneously detect
overflow if an intermediate result lies outside the range of double
precision so that data can be rescaled prior to starting some other
computation, then we should store at least the final sum to memory.
(If we were computing an inner product of two different vectors, we
would need to store every intermediate result.)

> If you are ready to deal with lower performance and precision, the x87
> permits you to be 100% IEEE compatible. Compilers have switches to force
> this compatibility mode by setting the PC to double (or single) and 
> forcing any arithmetic instruction to store the result to memory.

The x87 is 100% IEEE compatible in its default mode.  Whether in that
mode or with the precision control set to single or double, it can give
results which differ from those of other IEEE 754 implementations, but
the standard clearly allows such variation.  The problem Shewchuk's
example points out is the lack of bindings between IEEE 754 arithmetic
and programming language standards which would enable a programmer who
knows what semantics a program requires to specify them unambiguously.

Douglas M. Priest
SunSoft Floating Point and Performance Group
Sun Microsystems, Inc.
(but only speaking for myself)