Personal experience (not) implementing IEEE Superstandard Underflow

Sun Jan 7 10:08:10 PST 1990

David Hough is right when he says that hardware implementers became confused
about IEEE 754's allowances on detecting underflow in several different ways.

On the floating-point engine used in the 80387 and 80960KB, I chose to
detect underflow after rounding by inexact result.  At the time, I thought
that detecting underflow by an inexact result was the numerically superior
approach.  It was not until later, when I realized that I couldn't explain
to someone exactly how denormalization loss is detected, that I then asked
around and found an explanation of denormalization loss that I understood.
And it was then that I realized that I had not implemented the best underflow
detection mechanism.

So, at a *very* late date before committing the implementation to silicon,
I spent a day or two trying to detect denormalization loss instead of inexact
result.  But I couldn't figure out how in detect this condition in anything
close to the limited available microcode ROM space, and since inexact result
was allowed by the standard, we stayed with it.

Now, David Hough suggests a particular method for detecting underflow by
denormalization loss, and asks if there is convincing evidence that it
really is difficult to detect in a hardware implementation of IEEE 754.

My answer is that it would be very difficult in our implementation.
I wasn't able to do it in microcode.  The correct way to detect this would
have been in the hardware datapaths, with a little more detection logic to
look at the bits being shifted out on a right shift.  We would have needed
to add a method of detecting all 1's being shifted out, and the various
control signals to feed that back into the rounding.

Let me be more explicit here.  We had a shift array that shifted 68 bits
by 0-16 places in one cycle.  For shifts larger than 16 bits, the sticky bit
was preserved across operations by explicitly keeping it as the least
significant bit in the mantissa.  That is, we or'd the 0-16 bits being shifted
off into the least significant bit.  Or'ing (all 0's detect) on this bus was
easy, but and'ing (all 1's detect) was slow because it required logic.
Furthermore, we would need some extra control signals to set an all-1s/all-0s
latch at the start of a shift, and then modify it as part of the normal
sticky right shift operation, and then use the final value of the latch as
part of the rounding control logic.

I do not know for certain that we could not have made this work if we spent
enough time at the circuit design and planned for the extra silicon space
this would require.  However, I think is is extremely unlikely because we
did have a similar case that was much more important to us: detecting whether
an overflow or underflow might occur by checking for exponent overflow when
converting the internal format to the external representation.  Unlike
detecting denormalization loss, improving the speed of this operation would
measurably improve the performance of most floating-point operations.

We detected exponent overflow and underflow by looking for particular
sequences of consecutive 0's and 1's in the internal exponent register.
In our final implementation, we required two cycles to detect exponent
overflow and underflow.  The first cycle looked for overflow, and complemented
the exponent.  The second cycle looked for underflow using the same detection
hardware on the complemented exponent.  We tried for a few weeks to get this
down to a single cycle, but in the end could not get the circuit to be fast
enough.

I don't know if this analysis qualifies as "significant expense in an IEEE
system implemented in hardware."  Perhaps some other implementers can
speak up here and describe their experience.
-- 
Jim Valerio	jimvapzbaum.com, {uunet,omepd,reed}!pzbaum!jimv