X86 and IEEE double (or single) arith

Thu Jan 19 20:45:55 PST 1995

Jonathan Shewchuk Write:
> On chips like the 486DX that have internal extended precision registers,
> is it possible to calculate the single or double precision result of a
> single operation by performing the operation in extended precision, then
> storing the result to memory?  Rounding a result to 80 bits, then to
> 53 bits can produce a different answer than rounding directly to 53 bits.
> However, it is possible for a chip that stores 80-bit extended precision
> numbers internally to save a little extra state to ensure that when a
> number is stored to memory, it is rounded just as it would have if it had
> originally been rounded to 53 bits, thereby complying with the IEEE 754
> standard.  Does the 486DX do this?  If not, is there any way to correctly
> simulate 53-bit rounding - preferably using portable C code?  Does the
> Pentium (or other chips) have this flaw as well?
To elaborate on Gidon Yuval posting:
By setting Precision Control bits set to 53 in the Control Word, the rounding 
of IEEE operations will be done in 53 mantissa precision and 15 bit exponent
precision. If you 'store double' into memory right after it, the
exponent will be converted to 11 bits: you'll get double arithmetic emulation
for most cases. The exception is in the case you should have gotten a denormal
result. In this case a double rounding can occur. In order to handle it
you can unmask the default denormal handler and handle yourself the code. 
There is an 'add 1' bit in the Status World that keeps the information of 
the last rounding direction done in the previous operation. As far as I
know, commercial compilers don't support the IEEE compliance to this
extent. But for the normal mode they do. E.g. try Microsoft MSVC flag -Op.
 
> Here's a brief explanation of why I'm concerned:  I have an application in
> which I need to calculate a sum, and then calculate the exact roundoff
> error associated with that sum.  Given an FPU with exact rounding (the
> default configuration for an IEEE 754 conformant FPU), a well-known
> procedure due to Dekker can produce the desired results:
> 
>     Assume |a| >= |b| (if not, swap a and b)
>     x := a + b
>     e := x - a
>     y := b - e
> 
> This simple algorithm is guaranteed to produce a y such that a+b = x+y
> _exactly_.  (In other words, y is the roundoff error associated with the
> double-precision floating-point statement "x := a + b".)  Furthermore,
> x and y are both double-precision values, and y is no greater than half
> an ulp of x.  I might try to make this procedure work on a 486DX in C by
> declaring x to be a "volatile" variable, so that the result of the addition
> is stored to memory and then reloaded before execution of the first
> subtraction.  However, the procedure can fail if the addition is computed
> by first rounding the result to 80 bits of precision, and then to 53 bits.
For this case if the exponent of a is close to the denormal zone and b is 
denormal you'll be in the problematic zone. 
 
Benny Eitan
- I speak for myself, etc...