Satan loves extended precision

uunet!cwi.nl!Dik.Winter uunet!cwi.nl!Dik.Winter
Thu Jan 19 18:14:01 PST 1995


 > >>     Assume |a| >= |b| (if not, swap a and b)
 > >>     x := a + b
 > >>     e := x - a
 > >>     y := b - e
This should be
          y := e - b
of course.
 > >> 
 > >> This simple algorithm is guaranteed to produce a y such that a+b = x+y
 > >> _exactly_.

Indeed, Dekker indicates the requirements for this to work; what he
called faithful arithmetic.  I.e. round to nearest, but if two representable
numbers are equally close either number is good.  If your x, e and y are
double and internal arithmetic is long double it is easy to construct an
example where double roundings yield the wrong inexact result.  For instance
a = 2^53 + 2, b = 1 - 2^-53.  The correct result is of course x = a, y = b.
But the very first operation makes x = 2^53 + 4 if double rounding (first
80 bits, next 64 bits) is performed (rounding to 80 makes the sum 2^53 + 3,
rounding to 64 bits, 53 bit mantissa, rounds it up because of the round to
even rule).  Now y must be b-2 = -1 - 2^-53, but that is not representable
as a double.

However, Dekker starts with a faithful arithemetic and ends up with a
non-faithful doubled precision arithmetic.  For *that* purpose the result
is good enough.

Others have commented about what happens in the case of overflow and
denormals if this was done in extended precision rounded to double.
Note however that Dekker's routines do not fare well if overflow
occurs.  With his routines (see the exact multiplication of two reals %)
overflow occurs earlier than would normally be the case.  To use it you
need sometimes to scale the numbers.  Denormals are of course required
to make it work near underflow.  Dekker did not think about *that* because
the machine we used at that time (Electrologica X8) had denormals.

 > Yes, this will work fine as long as the values stored in x and e are
 > are the same as those loaded into the subsequent expressions.  (You
 > can force this by declaring x and e volatile or by turning off certain
 > floating point optimizations.)

Declaring volatile is, as I hear, not sufficient, putting a cast in the
proper place is:
    x = a + b
    e = (double)x - a
    y = e - b
works as per ANSI C.  Due to the warts of ANSI C in this field the
implicit cast in the assignment to x is not sufficient; even if it
is made explicit.  No provision is needed for 'e' because its value
is representable as a double.

 >                                 It won't matter that on the Intel chip
 > rounding might have been to 80 bits and then 64 bits.

It does, as outlined above the first operation can already give bad
enough results so that exact representation is not possible.

 > A cleaner solution is to just accumulate the sum in long double precision.
 > Then you don't need Dekker's trick.

Dekker's trick is to double the precision not to extend it by a fraction.
=========================
Other random comments and a note at the end:

Someone suggested the following sequence:
1.  Do addition with proper rounding.
2.  If inexact is set:
2a.    Redo addition with round to zero.
2b.    If last bit of result is zero, set it to one.
3.  Round to double
Better would be:
1.  Do addition with round to zero.
2.  If inexact is set:
2a.    Set last bet of result to one.
3.  Round to double

Further, as I have established in discussions on the net with Vaughan Pratt,
the minimal state you need to get proper results even after double rounding
is a 2n+1 bits mantissa for intermediate results; when the desired result
has n bits mantissa.  With my example above the first operation will be
correct if the number of bits for the intermediate mantissa is 107.
A collorary of this is that both double and extended are good enough to
get proper IEEE handling of single precision.

And finally the note:
%  I have for the next month no access to my references (they are stored
   because the building is being renovated and I am forced to work at
   home), but I think the exact multiplication is as follows.  Multiply
   x by 2^n + 1, where n is one half of the number of bits in the
   mantissa of x (rounded down if odd).  Store and divide by the same
   number; this will give you a number that has the first n mantissa
   bits of x, rounded up if needed.  Subtract from x and you get the
   remainder (which may be of different sign).   The same is done with y,
   and doing some exact multiplications will yield the exact representation
   of the product.

And more finally another note; there has been a paper a few years ago
(and I do not have access to it currently) that was a slight improvement
to the algorithm so that redoubling precision could be done.

dik [work phone does not go through to me; I get e-mail that I have been called]
--
dik t. winter, cwi, kruislaan 413, 1098 sj  amsterdam, nederland, +31205924098
home: bovenover 215, 1025 jn  amsterdam, nederland; e-mail: dikacwi.nl



More information about the Numeric-interest mailing list