questions and comments about gradual underflow

Sun Dec 5 15:40:47 PST 1999

From: David G Hough at validgh <validghavalidgh.com>
>I received the following:
>> Recently we have been revisiting the tradeoffs between
>> gradual underflow and "flush to zero". [...]
>> Unfortunately, it seems that those of us doing computer
>> architecture have not done a good job of finding ways
>> to implement gradual underflow without imposing
>> a large performance hit. [...]

Whether by coincidence or not, this came up last month on comp.sys.intel,
where a user was complaining about floating point computations that
were running much faster on the AMD K-7 (or K-6?) than on the Pentium,
diagnosed as due to the overhead of handling denormals on the latter.

Why the AMD was not incurring the same hit was not clear, and although
I was able to observe substantial denormalization overhead on a couple
of Ultrasparcs I was unable to duplicate the effect on either a Pentium
or a K-7.

Here's what I found with the following code.

/* Perform 2 million denormalized floating point subtractions */

main()
{
    int i;
    double x, y;
    for (x = 1, y = 2, i = 0; x; x /= 2, y /= 2, i++);
    printf("%d %g\n", i, y); /* Sanity check: expect 1075 4.94066e-324 */
    for (x = 2e6 * y; x > 0; x -= y);
}

Compiled with gcc, no optimization.

TIMINGS
                                   user system  elapsed   CPU
200 MHz Ultrasparc  Solaris 2.5.1  5.82  22.68  0:28.61  99.6%
300 MHz Ultrasparc  Solaris 2.6    4.30  14.97  0:19.28  99.9%
450 MHz Pentium-II  Linux RH5.1    2.20   0.00  0:02.23  98.6%
550 MHz K-7 Athlon  Linux RH6.0    0.82   0.00  0:01.80  45.5%

One factor in the difference is that x86 floating point registers are
80 bits.  However without optimization the arithmetic is done in main
memory (in effect) so that denormalization sets in at 2^-1023 (the
11-bit exponent of 64-bit floating point) rather than 2^-16383 (the
15-bit exponent of the x86's 80-bit floating point).  This is why all
four machines ended up with i = 1075.

With -O1 through -O4 on the x86, the arithmetic takes place in the
registers so i goes to 16446.  I haven't looked at the code to see what
gcc does with y in the transition to the second loop, but with -O1 the
program takes zero time indicating that y is saved between the two loops
thereby clearing it to 0, while with -O2 through -O4 it again takes a
second or two, which presumably indicates that x=2e6*y uses the y left
behind in the register rather than the y from main memory.

As an aside I find this roller-coaster dependence on optimization level
sucky, but as David's correspondent points out, the ISV's have more
important things to worry about than what purists consider right.

Vaughan Pratt