[Cfp-interest] third example of alternate exception handling: LAPACK larf

Tue Jun 17 16:49:10 PDT 2014

IEEE 754 Alternate Exception Handling Example:
LAPACK *LARF*

[sdcz]larfg.f is a building block of LAPACK that computes a Householder
transformation (reflection).     [sdcz]larfp.f was an attempt to improve
the output of larfg by guaranteeing that a particular output parameter
was real and nonnegative.    But it often failed due to intermediate 
overflow and underflow, even though the end result of a (unitary)
reflection on normal-size data can't overflow.    The latest 3.5 release
of LAPACK contains [sdcz]larfgp.f which attempts to avoid those gratuitous
intermediate overflows and underflows.

At first glance, alternate exception handling doesn't seem to offer any
help.    The inner loop of clarfgp.f, the complex*8 version,
cleared of all tests for exceptional cases, reduces to

      XNORM = SCNRM2( N-1, X, INCX )
      ALPHR = REAL( ALPHA )
      ALPHI = AIMAG( ALPHA )

         BETA = SIGN( SLAPY3( ALPHR, ALPHI, XNORM ), ALPHR )

         SAVEALPHA = ALPHA
         ALPHA = ALPHA + BETA
         IF( BETA.LT.ZERO ) THEN
            BETA = -BETA
            TAU = -ALPHA / BETA
         ELSE
            ALPHR = ALPHI * (ALPHI/REAL( ALPHA ))
            ALPHR = ALPHR + XNORM * (XNORM/REAL( ALPHA ))
            TAU = CMPLX( ALPHR/BETA, -ALPHI/BETA )
            ALPHA = CMPLX( -ALPHR, ALPHI )
         END IF
         ALPHA = CLADIV( CMPLX( ONE ), ALPHA )

           CALL CSCAL( N-1, ALPHA, X, INCX )

         ALPHA = BETA

This looks pretty simple - there are no explicit loops here.
SCNRM2 and CSCAL are lower-level building block loops
that do pretty simple operations:
computing the root-sum-square norm of a vector, and multiplying a
vector by a constant.

SLAPY3 and CLADIV do even simpler tasks with no loops - 
a root-sum-squares of a three-vector and
a complex division.

However SCNRM2, SLAPY3, and CLADIV go to quite a bit of trouble to avoid
problems with gratuitous intermediate
overflows and underflows - many extra divisions are
inserted that you might not think would be there until you'd considered
scaling to avoid overflows and underflows.     

http://www.netlib.org/lapack/explore-html/db/d66/scnrm2_8f_source.html
http://www.netlib.org/lapack/explore-html/d0/d30/slapy3_8f_source.html
http://www.netlib.org/lapack/explore-html/dc/da7/sladiv_8f_source.html

The performance is a far
cry from the simplest inline code that one could imagine.

But what if you could get the best possible performance in the normal
case and still maintain robust correctness when exceptions arise?

An alternate exception handling approach to this case would look conceptually
something like (mixing C and Fortran syntax - translating the Fortran into
C would obscure more than clarify)

try { 
 normal case code derived from above...
}
catch (FE_INVALID|FE_DIVBYZERO|FE_OVERFLOW|FE_UNDERFLOW) {
 exceptional case code from... 
 http://www.netlib.org/cgi-bin/netlibfiles.txt?format=txt&filename=/lapack/lapack_routine/clarfgp.f
}

What does "normal case code derived from above" mean here?
Just using the normal case code listed above wouldn't help much - the
bulk of the computational time is spent in the external subroutines
mentioned, doing all those scaling divisions whether needed or not.

So to get the performance benefit of alternate exception handling,
it's necessary to expand out the calls to SCNRM2, CLADIV, and SLAPY3
into their simplest semantics with no scaling so they can execute as
fast as possible in the normal case.     And

	ALPHA = CLADIV( CMPLX( ONE ), ALPHA )

could be expanded to ALPHA=ONE/ALPHA or even

	ALPHA=CONJG(ALPHA)/(REAL(ALPHA)**2+AIMAG(ALPHA)**2)

(two real divisions, not one complex division).

In C there is even a #pragma specifically for cases like this:

	#pragma STDC CX_LIMITED_RANGE on

means that complex arithmetic (particularly division and more complicated
operations) can be evaluated quickly rather than safely.     Normally this
amounts to an assertion by the programmer that the data will be "safe",
granting forgiveness to the compiler if the programmer is wrong.
In this case, the programmer makes no such claim, but stands ready to deal
correctly with unsafe data in the catch clause, which takes all 
necessary precautions.

Of course it's not required to undo all the nice structured layering of
LAPACK.    If one could be sure that alternate exception handling propagated
properly when calling separately-compiled functions, 
then it would suffice to have "fast" simple versions of the necessary LAPACK
building blocks for invocation only from within try clauses.

The CSCAL loop source code takes no precautions and so is about as
fast as possible, and SLAPY3 and CLADIV are only invoked once.
So in this program the principal potential 
performance benefit would arise from the loop

  XNORM = SCNRM2( N-1, X, INCX )

which is heavily fortified with divisions in the standard LAPACK source code, 
though there are other possible approaches to making it faster and robust.