Highlights from discussions with W. Kahan on 28 June 1991

David Hough sun!Eng!David.Hough
Mon Jul 1 17:53:35 PDT 1991


Kahan visited to give a presentation about exception handling; we then
met with Jim Thomas to talk about NCEG issues and more exception handling.
The following are some of the points I specially noted.  In the course of a
12-hour discussion there was a good deal more, but much of it is part of 
his standard exception handling talk.

Complex multiply: Kahan thinks this should be a hardware instruction, because
it's so hard to get the critical cases for conformal mapping correct and
fast in software.  In particular infinite operands lead to trouble with
gratuitous NaN results, while zero results often have the wrong signs
which lead to problems in conformal mapping involving slits in function
definition domains.

Max/min: 
Kahan didn't say this needed to be in hardware but he did say that
max(x,y) should be identical to max(y,x) for all x and y including NaNs,
and this is a tall order to do quickly.

Numerical Turing:  
Kahan pointed out that the method of attaching attributes
to specific operators in specific expressions, as I proposed last week,
is in Hull's Turing language.  

Presubstitution:  
Not only is it necessary to have presubstitution value
registers for each kind of floating-point destination - on SPARC-like
systems, single, double, extended/quad if implemented, integer, and
condition code - but the presubstitutable values have to flow through the
pipe with the instructions they modify.  I hadn't noticed this until he
pointed it out, but one of his favorite examples, computing a continued
fraction and its derivative, requires a different precomputed value on each
iteration of the loop, which looks like:

	for j ... {
	d = x + f ; d1 = 1 + f1 ;
	q = b(j)/d ;
	f1 = -(d1/d)*q ; f = a(j) + q ;
	on 0*inf presubstitute b(j-1)*d1/b(j) ; }

So even though no conditional branch is required,
a presubstituted value must be computed on every iteration which requires
a division.  
Presumably the presubstituted value gets loaded during the delay slot of the
branch back to the beginning of the loop.
Compare with my last proposal which would look something like

	for j ... {
	d = x + f ; d1 = 1 + f1 ;
	q = b(j)/d ;
	f1 = -(d1/d)* [FE_invalid goto fixup] q ; 
loopend:
	f = a(j) + q ; 
	p = d1 ;
	}
 ...
fixup:  f1= b(j-2)*p/b(j-1) ; goto loopend ;


or in another syntax suggested by Bill Homer,

	for j ... {
	d = x + f ; d1 = 1 + f1 ;
	q = b(j)/d ;
	f1 = SUB_INVALID((-d1/d)*q,b(j-2)*p/b(j-1))  ; 
	f = a(j) + q ; 
	p = d1 ;
	}

SUB_INVALID is a macro which substitutes the second expression if
the indicated exception occurs in the last operator of the first.  I don't
know if this simplifies optimization over the previous one in which the
spaghetti is more apparent; in any event it's still helpful to have a
goto option as well for more complicated situations.  

You have to compare
the disruption caused by the conditional branch on every iteration to the
disruption caused by computing the complete substituted value on every
iteration to determine benefit, then compare the cost of providing conditionals
on IEEE current exception bits to the cost of providing presubstitution 
registers.
I had been thinking of presubstitution in terms of simpler cases in which the
presubstituted value was determined outside the inner loop, and even then
I was doubtful.

Counting Mode:
I argued that Counting Mode was not worth putting in hardware because the
prototypical computation of a product such as a determinant by

	d = 1
	do 1 i = 1,n
 1	d = d * a(i,i)

could be made immune to overflow and underflow by testing for them after
the loops and doing a scaled computation with frexp if they had
arisen.  That's attractive because although gratuitous intermediate overflow
or underflow is something that linear equation solvers must guard against,
it happens rarely enough that a recomputation from scratch in those cases
is fine.

He pointed out that physicists compute quantities like the Klebsch-
Gordon coefficients which are quotients of long products of moderate-sized
numbers, with end results moderate in size, but the products are so
long that gratuitous intermediate overflow and underflow are normal rather
than rare events.  Thus prescribing recomputation with frexp is useless;
you'd do the computation that way from the start and pay the cost if you didn't
have counting mode, or a higher format with greater exponent range, and
you wanted to be immune from exponent spill.
So the question then becomes, how often do such computations arise, where
long products are required, and the computation of the terms of the products
is small compared to what would be required to compute scaled products via
frexp, so that counting mode would yield a real performance improvement:

	do 1 i = 1,n
	 ... compute i'th factor f ...
	p = p * frexp(f,&if)
	ip += if
 1	continue	

vs.

	do 1 i = 1,n
	 ... compute i'th factor f ...
	p = p * f
 1	continue	

How efficient could frexp be if it were known to the compiler and properly
supported by a hardware instruction?

Hardware support for exception handling:
All of the above examples turn on arguments like the following:

How expensive is it, on the highest performance systems, to provide 
restartable or continuable traps on floating-point exceptions?

Kahan defines restartable to mean that floating-point traps other than
inexact are detected early enough in the fp unit that it can trap the cpu
early enough to prevent the PC from advancing, so that the PC on the trap
points directly to the offending fpop and the integer unit has not advanced.
Evidently MIPS ISA works this way.

I define continuable as meaning that floating-point trap handlers receive
enough information to decode the offending instruction and determine the
operands and insert a substitute result in the destination and continue if
desired.  SPARC ISA works this way.  A difference from restartable is that
the integer PC will have advanced to the next fpop or further.  
SPARC is precise in that the
address of each fpop is maintained in the fpop  queue and is available to
the trap handler.  Implementors have complained because this doubles the
size of the queue.

I doubt that either of these approaches are consistent with very high
performance systems, although some people think the problems of continuing
from fp traps are no worse than continuing from page faults or recovering
from badly guessed speculative execution.

Regardless of how expensive these facilities would be in carefully crafted
implementations, the current state of the computer industry is that 
being three months late is like being 20% slower, so any feature that
drastically increases implementation risk is likely to be resisted.

But if these facilities really aren't that difficult to implement, then many
very nice user-mode exception handling facilities can be built up from traps,
including counting mode, presubstitution, and retrospective
diagnostic messages much more informative than Sun's Fortran 1.4

 Note: the following IEEE floating-point arithmetic exceptions
 occurred and were never cleared; see ieee_flags(3M):
 Inexact;  Underflow;
 Sun's implementation of IEEE arithmetic is discussed in
 the Numerical Computation Guide.



More information about the Numeric-interest mailing list