more comments about exceptions

Mon Dec 6 09:48:54 PST 1999

Received from Joe Darcy:

From: "Joseph D. Darcy" <darcyaCS.Berkeley.EDU>
Subject: Re: questions and comments about gradual underflow
Date: Sun, 5 Dec 1999 23:32:28 -0800 (PST)

>I received the following:
>
>> Recently we have been revisiting the tradeoffs between
>> gradual underflow and "flush to zero". Of course, we
>> haven't been debating which is the computationally better 
>> choice but rather the realities of the market. And we
>> haven't been debating gradual underflow support only
>> how much resource to dedicate to make it perform well.
>
>> Unfortunately, it seems that those of us doing computer
>> architecture have not done a good job of finding ways
>> to implement gradual underflow without imposing
>> a large performance hit. (It is a very difficult
>> problem in high speed superscalar pipelines where a 
>> potential underflow may need to cause a pipeline stall, 
>> flush, etc.

[snip]

>I always thought of underflow to subnormal as like a page fault,
>relatively rare but not so rare that performance can be completely ignored.

[snip]

>If you notice the performance anyway on a particular program, by a
>compile-time option or run-time function call you can enable a
>bit in the SPARC %fsr register for nonstandard mode that causes 
>subnormal operands and results to be interpreted as zeros, in hardware with no 
>traps to software.     
>Being nonstandard, the exact definition of this mode 
>varies from system to system; that's why it's called nonstandard.
>On supersparc and microsparc systems that handled subnormal operands and results
>completely in hardware, the nonstandard bit was a noop.
>
>As for ISV's, as with other compile-time optimization switches, 
>ISV's investigate nonstandard mode after they encounter a performance problem.

As a general comment, people often seem more concerned with
performance than correctness.  If you don't care what is computed, why
do you care how fast it is computed?

>Most never do.  Certainly nobody should enable nonstandard mode
>for portability, although ISV's that have ported applications from
>VAX, IBM mainframes, and Crays have no common expectations about exception
>handling and so exploit nothing specific to IEEE 754 hardware.
>
>As for the future, there seems to be oscillation between putting as little as
>possible in hardware, to maximize clock rate, vs. putting doing it all
>in hardware, to minimize performance anomalies.    The extreme case here
>has been DEC alpha systems, which as far as I understand, 
>can't provide correct IEEE exception handling without significantly 
>slowing down programs even when no exceptions are encountered.

A few years ago for a class project, a partner and I looked into the
performance of arithmetic operations on subnormal and non-finite IEEE
values on a variety of architectures.  Early iterations of the Alpha
architecture (we looked at a 21164) had to run in a degraded mode if
there was even the possibility of generating a NaN or infinity, to say
nothing of subnormals.  Due to the way the architecture is designed,
on a given implementation it may be necessary to have a trap barrier
instruction in each basic block with a floating point instruction.
(If something exceptional actually does occur, the trap handler has to
trace backwards through the instruction stream to figure out which
instruction caused the trap and see what registers hold the initial
operands.)  When given the "use true IEEE 754 semantics" flag, the DEC
compiler we used for the project inserted a trap barrier after *each
floating point operation*.  Consequently, floating point pipelining
was eliminated and performance suffered accordingly.  In our
measurements, allowing for IEEE special values resulted in a 60% drop
in SPEC95 performance (7.05 to 2.87).  IIRC, the next rev of DEC's
compiler did use only one trap barrier per basic block, which should
reduce the penalty in the non-exceptional case; but we didn't make any
measurements using that compiler.  Moving along the the wheel of
progress, I believe more recent Alpha chips have added hardware
support for NaNs and infinities.

Overall, we found uneven support for all IEEE 754 special values.
PA-RISC did NaN in hardware but not infinity.  UltraSPARC did NaN and
infinity in hardware but subnormals were extremely slow, greater than
10,000 cycles for a single operation.  In contrast, other processor/OS
combinations took "only" about 1000 cycles when they had to trap for a
subnormal or other special value.

While subnormals can require some non-trivial resources, infinity and
NaN support basically just need an extra mux at the bottom of the fp
pipe.  The fpu already has to detect these special values; it just
needs a small amount of additional hardware to make these operations
go fast.

If anyone is interested, the compressed postscript of the project
report is online at

http://www.cs.berkeley.edu/~darcy/Research/fleckmrk.ps.gz

-Joe Darcy
darcyacs.berkeley.edu