NCEG Exception Handling: Kahan vs. O'Dell

Mon Sep 23 18:35:44 PDT 1991

It's probably the least significant, but among the reasons I haven't made much
progress in proposing exception handling for NCEG is a fear on the one hand
of foreclosing valuable facilities such as Kahan envisions, and on the other
hand of imposing requirements which would impose intolerable performance
penalties on high-performance systems when running common unexceptional
programs - the dilemma described by Mike O'Dell in his paper about his 
experiences trying to design a high-performance Unix processor.  He discovered
that traditional Unix mechanisms such as user-mode handling of page faults
had profound implications for high-performance hardware design that many would
consider out of proportion to their intrinsic value.

The basic mechanism that Kahan would like is restartable floating-point
exceptions.   A SIGFPE handler should be able to figure out the PC of
an instruction that failed, obtain the instruction, revise the operands,
and re-execute.  With that mechanism available you can implement anything
in software, slowly perhaps, but faster than by explicitly testing operands
before use which will be slow even if no exception occurs.

Kahan is willing to accept a general pre-substitution mechanism but that
requires a fairly complex mode defining the value to be substituted
to accompany each operation, again increasing
the cost of implementing the normal case.  The fatal flaw, I think, is the
requirement that it be possible to change the pre-substituted value on 
every iteration of an inner loop.

Under the current regime of very fast technology progress, anything which
tends to complexify design or implementation or test, costs time to market
and hence performance.  Anybody who doesn't think that matters should try
using a 66 MHz HP Snake, if you can find one; what's most impressive to me is
the way ours zips through Makefiles and shell scripts... and the floating
point is pretty fast too.  The Snakes are said to be about 2 years from
initial concept to initial ship, which is pretty good, but a year from now
their performance won't be very noteworthy; a slip of a year due to a complex
design would have been fatal.  I don't think the lesson has been lost on most
hardware designers.

So I don't think an elaborate hardware mechanism to support presubstitution 
stands much chance of implementation soon.

Instead you could imagine three modes for IEEE floating-point exception
handling:

1) nonstop default

2) noncontinuable trap

3) restartable trap

"Noncontinuable trap" is for implementing traditional behaviors in which
unexpected exceptions are to terminate execution or at least provoke a long
jump.  When an exception occurs in that mode, a SIGFPE eventually occurs
for which nothing is guaranteed:  it may not be possible to determine 
exactly the instruction that provoked the trap,
and whether returning from the trap handler can be done meaningfully is
depending on implementation, and on high performance implementations the
answer is usually no.  Performance in the unexceptional case is the same
for "noncontinuable trap" and "nonstop" modes.

"Restartable trap" supports software implementation of as many exceptional
handling schemes as anyone could desire.  The PC points to the instruction
that generated the trap, or maybe to the next one, so it is reasonable to
pass the PC to the SIGFPE handler and for the SIGFPE handler to act upon it.
In unexceptional cases, "restartable trap" mode
is slower in than the other two on high-performance implementations,
because it inhibits much potential instruction-level parallelism.

I can imagine two ways to obtain "restartable trap" behavior:

1) have three dynamic trapping modes in hardware

2) have two dynamic trapping modes in hardware, 
   plus a synchronization instruction

The three hardware modes nonstop/noncontinuable/restartable allow debugging
executables without recompiling, which is certainly helpful, but require
two mode bits per exception in the FP status register, and may be more
expensive to implement in hardware than the alternative:

Two hardware modes nonstop/trapping don't distinguish noncontinuable from
restartable traps at run time but require that if restartable traps are
desired, synchronizing instructions must be inserted at compile time,
after every FP instruction that might trap.  These synchronizing instructions
block the PC from advancing until all previous instructions have completed
or can guarantee that they won't trap (some MIPS-like cleverness can help
performance on the guarantee).   The synchronizing instruction have no other
effect. 

Thus to debug you have to recompile at least the parts of the program you
are interested in.  

My question for hardware implementers is whether "three modes" or "two modes
plus synchronization", is preferable
from the point of view of implementation complexity and performance in the
unexceptional case, or indeed whether either can be incorporated in
a high-performance superscalar, out-of-order, or speculative implementation
at reasonable cost.

*******************************************************************************

MORE ON DYNAMIC MODES

As I've indicated previously I have doubts about the the dynamic modes bits
of IEEE arithmetic.   I don't think there's any reasonable alternative to
the trap enable/disable bits, but I think rounding modes can be done without.
The primary use of rounding modes is 

1) in interval arithmetic.

2) for sensitivity testing.

I think the proper solution for interval arithmetic is to incorporate the
rounding mode in the opcode, which implies a language model in which rounding
modes are determined staticly at compile time.  I think a syntax which 
attaches the rounding mode to specific operations in specific expressions
is all that's required.  That encompasses changes to language standards, but 
that is the lesser of several evils.   Lots of NCEG work would be simpler
if rounding modes were static.

Sensitivity testing by altering rounding modes at run time is fun and 
convenient but perhaps not worth the cost.  The Cray idea of having 
compile-time code generation options to zero the last N bits of each 
floating-point result is probably good enough.

*******************************************************************************

MORE on EXCEPTION TESTING

For situations where
the overhead of restartable traps and SIGFPE handling is too much to
bear, and you don't want to be constantly clearing and testing the status
word either, I have previously proposed that conditional branches on specific
exceptions generated by the last FP op might be a preferable alternative.
Corresponding to a hardware instruction (which happens to act much like
a synchronization instruction!)

	fbnvc	address

which conditionally branches to address if the last FP op set the nvc 
exception flag,  there is a language construct that attaches a local GOTO
label to a specific operation in a specific expression.  If the exception occurs
on that operation, the GOTO is taken.   GOTOs are not pretty but they
correspond exactly to what the hardware can easily do.

Again the hardware question is whether such conditional branches can be
implemented in a way that does not impose a performance penalty on programs
that don't use them.

I proposed a notation like 

	z = x * [on FE_INVALID label1] [direction FE_UPWARD] y

which converts a normal multiplication z=x*y into one that is performed in 
a directed rounding mode FE_UPWARD and, if an invalid exception occurs,
branches to label1.

There are several points to note about this kind of syntax:

1) You seldom want to evaluate whole complicated expressions or subprograms in 
a particular rounding mode.  You really need the ability to attack particular
operations.

2) You seldom want to test for one specific exception on any of a number of
operations in a complicated scalar expression (array expressions are 
different).  There's usually one specific operation that can raise the
critical exception.

In this instance pragmoid directives that can apply to arbitrary expressions
just make more work for the compiler (different code has to be generated)
without buying much.  But for the sake of uniformity of syntax, we might
consider them.  Stallman has proposed what I would call pragmoid directives -
better than pragmas because they can be freely used with macros - that
could be used as a model for the foregoing:

	z = (__oninvalid__ label1) (__direction__ FE_UPWARD) (x*y)

__oninvalid__ and __direction__ are new keywords.  It's certainly more C-like
in syntax, although the generalization to Fortran is not necessarily 
an improvement.

In order to use this style of directive when
the statement is z = x + y + z and you want to apply the 
special directives to the first + but not the second, you would have to break
up the statement into two parts, perhaps with a loss of performance in some
circumstances.