avoiding restartable traps and presubstitution

Sat May 4 20:07:21 PDT 1991

Continuable traps allow you to examine the state of the system, including
the trapping instruction and its operands, compute a result, and continue.
Such traps are recommended by IEEE 754.

The cost of continuable traps is very high on highly-pipelined systems,
especially those that produce results out of order.
A lot of hardware mechanism is required to insure that the state of the
system can be unwound appropriately; this silicon might better be put to
other uses that contribute to performance in the normal case.

For this reason Kahan is currently in favor of "presubstitution": a facility
by which user-mode programs can define the results of an exception should
it arise, without requiring a hardware trap which can deliver the operation,
operands, and exception, and in which a user mode program can recompute a
result different from the IEEE default, and return and continue execution.

In the simplest case, imagine that for whatever reason you would prefer to
return the (appropriately signed)
largest number in a particular format, rather than the 
IEEE 754 default result infinity, should floating-point division
by zero arise in a certain loop.   You might code it something like

	call presubstitute("division_by_zero", "max_float")
	<loop>
	call presubstitute("division_by_zero", "default")

The presubstitute function would change the hardware state so that 
the appropriate result would be generated instead of infinity if division
by zero were to occur.	

The underlying hardware implementation might have special registers containing
the intended result.  You can easily imagine this would get fairly complicated;
exceptions like underflow, overflow, and division by zero
would want to use the 
presubstituted magnitude combined with the usual sign of the result, while
exceptions like invalid would want to use the presubstituted sign and 
magnitude.  All these special registers would add to the machine state,
and the data paths to access them would cost something.

I wonder if something much simpler might solve 90% of the problem.  I think
there are about four common situations:

	IEEE default
	SIGFPE to a handler that must abort or longjump rather than return
	VAXCENTRIC avoidance of IEEE subnormal and infinite results
	WRAP overflowed and underflowed exponents

So these four cases could be encoded in a couple of mode bits per exception
(rather than the one bit typically used).  OR the WRAP, particularly, and
to a lesser extent the VAXCENTRIC could be encoded
as different op codes.  

WRAP--------------------------------------

WRAP mode is for determinants and other situations where a product or quotient
of many factors is needed, and while the end result may well lie in range,
intermediate overflow or underflow is to be expected from time to time.
The requirement for wrapped exponents on trapped overflow and underflow
is embodied in IEEE 754 for these situations.

The easiest way to deal with that problem is to use an intermediate format
with a larger exponent range, but if that's not available then you need to
set up overflow/underflow signal handlers to do the wrapping and keep track
of the number of wraps up and down, and if that's not available then you need
to set up overflow/underflow signal handlers that abort the loop and recompute
it using logs or scaling or other means of avoiding over/underflow.

To actually use WRAPped exponents from a higher-level language requires some
language features that don't exist, perhaps because the hardware support is
not often available.  For instance, the Weitek 1164/5 floating-point chips
used in the original Sun-4's can produce wrapped results but the SPARC
architecture has no place to put them; storing a wrapped result in a 
destination register may wipe out an operand that a signal handler would have
liked to have.  

Anyway where you might conventionally write

	p=x(1)
	do 1 i=2,n
 1	p=p*x(i)	

you might prefer to be able to write something like

	k=0
	do 1 i=2,n
 1	p=p *!k! x(i)
	if (p .ne. 0 .and. k .gt. 0) print *," overflow "
	if (p .ne. 0 .and. k .lt. 0) print *," underflow "

relying on a language feature that interpreted *!k! as a multiplication
that wrapped on overflow or underflow, incrementing k in the case of overflow,
and decrementing k in the case of underflow.  This language feature in turn
is based on op codes like

	fwmuld	%f1,%f2,%f3 

that computes floating-point register %f3=%f1*%f2, 
wrapped if necessary; followed by

	fwrap	%i1 

that increments integer register %i1 if the previous fpop caused an overflow,
decrements if an underflow, and does nothing normally.

You can imagine various objections to the foregoing but the hardware is
likely to be easier to implement than continuable traps.

VAXCENTRIC ------------------------------

mode is mostly for pre-IEEE 754 programs and programmers, although it may
offer performance advantages in some cases.  The idea is to avoid generating
any subnormal or infinite results for underflow, overflow, or division by zero.
Whenever 754 calls for a subnormal result, 
produce instead a zero of appropriate sign.  Whenever 754 calls for an
infinite result, produce instead the largest normal magnitude of appropriate
sign.

In this mode
I am inclined not to change the IEEE handling of subnormal or infinite or NaN
operands, nor of invalid exceptions; if you get these in VAXCENTRIC mode you
still need to do something else for the program.

SPARC has a mode bit which enables nonstandard arithmetic; so far it has
always been used to enable a faster hardware mode that flushes subnormal results
to zero and treats subnormal operands as zero.

QUESTION ----------------------------------

Are there other important cases than those above,
for which overflow, underflow, or division by
zero might best be handled some other way by presubstitution?

INVALID ------------------------------------

Neither WRAP nor VAXCENTRIC mode allows customization of the results of
invalid operations, of which 0/0 and sqrt(negative) are perhaps most
important.  One could imagine extending the VAXCENTRIC idea by defining 
different default results for some of the various invalid operations;
these definitions would be appropriate for some but not all problems:
inf - inf -> 0, 0 * inf -> 1, 0/0 -> 1, inf/inf -> 1, x % 0 -> 0,
inf % x -> 0.  

What should be done about sqrt(negative)?  People have at various times
implemented sqrt(negative) -> 0 on the theory that the argument should have
been zero or small positive but went negative due to roundoff, and
sqrt(negative x) -> -sqrt(-x) on the same theory but with the hope that
if the theory were unjustified a negative sqrt would lead to more questions
than a zero sqrt would.

What should be done about a signaling NaN?
I can think of two probably applications for Signaling Nans:

	1) debugging by initializing memory to 1 bits instead of 0 bits
	2) user-defined IEEE 754 extensions based on continuable traps

The second application has a dim future because of the general problems of
continuable traps.  The first application is valuable and would be more so
if there were convenient ways to appropriately initialize memory.  This 
application however presumably would be used with the intent of aborting
a program that encountered a signaling NaN, so what happens in VAXCENTRIC
mode would not matter.

INHIBITED MODE ----------------------------

Considering all the foregoing, would it be better to attack the problem
of special treatment of invalid exceptions by special language and hardware
that avoids both presubstitution and continuable traps, without
slowing down the normal case as much as with other alternatives?  
Consider conventional code

	do 1 i=1,n
 1	z(i) = x(i) / y(i)

replaced by

	ieee_flags("clear","exception","invalid")
	do 1 i=1,n
 	z(i) = x(i) / y(i)
	if (ieee_flags("get","exception","invalid") .ne. 0)
		z(i) = 0                                  ... or 1 or whatever	
		ieee_flags("clear","exception","invalid")
		end if
 1	continue

which probably requires a subroutine call in the inner loop and at best
requires halting the pipe and examining the accumulated IEEE exceptions.
Now wouldn't you rather do

	ieee_flags("set","exception","invalid","inhibited")
	do 1 i=1,n
	z(i) = x(i) /!!2!! y(i)
 	goto 1
 2	continue
	z(i) = 0
 1	continue	

Some novel operator syntax here directs abnormal execution to label 2.
The operation in progress is not completed (preserving z(i)).
This would be based on an inhibited hardware mode for each exception,
in which exceptions set the current exception bits and inhibit storing
results, and on hardware instructions like

	fbinhibited	addr

which performs a conditional branch to addr if the last fpop generated an
exception for which inhibited handling was specified.  
Thus the branch (like the wrapped handling described
earlier) is based on what are called the current exception (cexc)
bits in SPARC fsr or the exception status bits in 68881 fpsr.  These aren't
part of IEEE 754 but are built into most implementations.  In
pipelined systems one can imagine the cexc bits from each operation 
flowing along with the corresponding result, the way condition codes sometimes
work.  The conditional branch slows down
normal operation but perhaps less than alternatives that provide comparable
flexibility. 

Adding up the modes for each exception we get

	IEEE default (all five IEEE 754 exceptions)
	SIGFPE (all five)
	VAXCENTRIC (overflow, underflow, division by zero)
	WRAP (overflow, underflow)
	INHIBITED (all five)

If we were to implement inhibited processing for all five IEEE
exceptions we would have five possible modes for overflow and underflow,
a bad deal, bit-field-wise,
if we could get by with four modes.  So the temptation is
to reserve inhibited as an option for division by zero and invalid
exceptions only, so each of the five IEEE exceptions has only two exception
handling mode bits encoding:

	IEEE default (all five IEEE 754 exceptions)
	SIGFPE (all five)
	VAXCENTRIC (overflow, underflow, division by zero)
	WRAP (overflow, underflow) + INHIBITED (division by zero, invalid)

QUESTIONS --------------------------

Is this enough to be satisfactory for all the important
cases where one might want to do presubstitution?  Some language extensions
are necessary for efficient implementations, but that's true in any case.

Would you hardware types rather implement continuable traps, or general
presubstitution, or VAXCENTRIC, WRAP, and INHIBITED modes, or something else
that solves the problem?