Floating Point Exceptions - Features that didn't get into SPARC

Mon Oct 29 13:45:29 PST 1990

The RIOS trapping decisions brought to mind some old discussions.
The following memo dated 6 January 1985 illustrates, if nothing else,
the progress that I've made in understanding computer arithmetic.
In addition it points out another method for dealing with floating-point
traps.

The context was that the SPARC architecture was still somewhat fluid
and the name SPARC hadn't even been appropriated yet.  The Sun 2/50
and SunOS 2.0 had recently been announced and started shipping.

I guess I would classify the following as an exercise for the student:
determine if any of the ideas here are worth preserving.

*****************************************************************************

     Any floating point architecture intended to be an industry standard will
be judged according to two pairs of conflicting criteria: old programs vs. new
programs, and fast execution vs. reliable execution.  All requirements can be
encompassed in a sufficiently complicated architecture.  Unlike previous
proprosals, this one derives from first principles without considering a
specific implementation.

OLD PROGRAMS VS. NEW

     By old programs I mean those written in Fortran that run on machines with
a variety of floating point implementations.  These programs have by necessity
evolved to obtain a degree of immunity to floating point details.  These pro-
grams can not take advantage of IEEE arithmetic or of extended precision
expression evaluation without losing their portability.  I describe new pro-
grams, in contrast, as written for standard IEEE arithmetic and extended pre-
cision and possibly dependent on nonstandard language extensions.  They are
portable to other IEEE systems in a more limited sense that it may be neces-
sary to rewrite the code that uses the nonstandard language extensions.

     Some arithmetic units are designed for old programs: Weitek, National,
Sky.  They don't contain any features that aren't in standard Fortran.  Other
units are designed for new programs: Intel, Motorola, Zilog.  They are satis-
factory for old programs but work even better for new ones.  A program that is
intended to be bulletproof is usually faster if it can exploit extended, even
if an individual extended operation is somewhat slower than an individual dou-
ble precision operation.

     Note that "extended precision" and "IEEE arithmetic" are orthogonal
features; neither depends on the other.

FAST EXECUTION VS. RELIABLE EXECUTION

     There will always be a demand for arithmetic that executes as fast as
possible.  This demand is most serious at supercomputer centers that have
large staffs of numerical experts on hand to figure out how to get reliable
results from sloppy arithmetic.  Closer to home, applications like signal pro-
cessing and graphics manipulation often benefit more from very fast floating
point than from very accurate floating point.

     Everyone wants reliable, predictable, and sequential floating point while
debugging programs, even if they're willing to trade off some of that for
speed later.

GOALS FOR YET ANOTHER INDUSTRY STANDARD FLOATING POINT ARCHITECTURE

1.   Reliability: by default, as reliable numerically as a 68881.

2.   Speed: when requested, execute inner loops as fast as possible.

3.   New programs: support full IEEE arithmetic with extended precision as
     well as a 68881.

4.   Old programs: pay no architectural penalty for IEEE arithmetic or
     extended precision.

5.   Easy code generation: from higher level languages.

6.   Compact code: do not retard execution with numerous instructions for type
     conversion and dynamic saving, setting, and restoring of floating point
     modes.

NON-GOALS FOR YAFPA

1.   Reduced instruction set ideological purity.

2.   Minimal number of instructions.

3.   Minimal implementation cost.

TECHNIQUES FOR OBTAINING GOALS

1.   Modes for old programs.

2.   Modes for new programs.

3.   Optional pipelined execution.

4.   Instructions for coordination and resynchronization of pipelined execu-
     tion.

DEFERRED TRAPS

     To expedite high performance, IEEE floating point traps are not taken
during pipelined floating point execution.  Instead they are deferred.

     To understand deferred traps, consider the case of sequential processing.
There is a set of exception-occurred status bits and a corresponding set of trap-
enabled mode bits.  When a sequential instruction sets an exception-occurred
bit for which the corresponding trap-enabled bit is set, an IEEE floating point
trap is taken.  The PC points to the offending instruction; the original
operands are intact.

     During pipelined processing, the PC is hopelessly confused, and meaningful
IEEE traps can not be taken.  Instead, a record of traps that should have been
taken is maintained for a later reckoning.  This record is another set of
deferred-trap status bits corresponding to the exception-occurred bits.

     When an FPSYNCH is executed at the end of a loop or basic block, any
deferred traps take effect, but the effect is much different from a trap dur-
ing a sequential instruction. The possible effects are to execute substitute
code or to cause a system trap that indicates an unrecoverable floating point
trap should have occurred.

     Code that exploits FPSYNCH and re-execution looks like this:
                                | Begin basic block
        PIPELINED EXECUTION...
        FPSYNCH         1$
        NOP                     | Delay position.
        NON-PIPELINED RE-EXECUTION...
1$:
                                | End basic block

CODE GENERATION MODEL

     Programmers have five independent options for code generation; in each
case the first is the default and debugging option, the second is the high
speed option:

1.   Sequential/Pipelined execution;

2.   Modeful/Modeless execution;

3.   Extended/Fortran expression evaluation;

4.   Very/Less strict IEEE code generation.

5.   Re-execute/System-trap upon discovery of deferred IEEE traps.

COMPUTATIONAL INSTRUCTION FORMAT

     Floating point instructions contain at least 19 bits in the following
format:

1.   F bit: 1 bit that distinguishes floating point instructions from integer
     instructions.

2.   Pipelined bit: 1 bit that specifies sequential or pipelined execution.

3.   Modeless bit: 1 bit that specifies whether results are affected by the
     contents of the IEEE mode register.  If not, results are computed accord-
     ing to the default rounding and no trapping modes.

4.   Op code: 4 or more bits that specify an operation chosen from a set
     including [CLASSIFY, CMP, CMPU, MOVE, SQRT, AINT, +, -, *, /, PREM,
     SCALE].

5.   Rs1, Rs2, Rd: 3n bits that specify source and destination registers from
     among 2^n floating point registers.

6.   Ts1, Ts2, Td: 6 bits that specify types of sources and destination.  Each
     two bit field specifies a type from among [integer, single, double,
     extended].

DISCUSSION OF COMPUTATIONAL INSTRUCTION FORMAT

     Pipelined bit: If the pipelined bit is off then that instruction does not
execute until all current floating point and integer instructions have com-
pleted, and the integer unit does not execute any new instructions until that
floating point instruction completes.  If the pipelined bit is on then that
instruction executes as soon as an appropriate unit is available and the
integer unit is not affected.  This allows the compiler, and potentially the
source language code, to control pipelining if it is available in the
hardware.  An implementation might well have a dynamic pipelined bit in the
floating point status that would be ANDed with the instruction's pipelined bit
to determine whether pipelining is allowed.

     Modeless bit: The purpose of this bit is to not to allow floating point
arithetic instructions to execute faster, where the effect would be minor, but
rather is to avoid repeated saving and restoring of the floating point mode
register in situations where it is subject to change and must be preserved.
For instance, elementary functions can be written with modeless instructions
except for the last assignment, thereby avoiding spurious exceptions without
the penalty of saving and restoring the modes.

     Modeless floating point has one pragmatic quirk: when the type of the
result is integer, rounding is toward zero rather than to nearest.  This sim-
plifies code generation for statements like
        i  = x / y
in Fortran and C, or
        i := trunc(x/y) ;
in Pascal.

     Floating point types: integers and singles require one word, doubles two,
extendeds three.  Some possible combinations of operands, operations, and
results are not interesting, such as using floating point operations to add
two integers.

     Floating point registers: Four extended precision registers are satisfac-
tory for almost all floating point code generation, but twenty single preci-
sion registers have been requested for three dimensional graphics transform
applications.

     Floating point operations: CLASSIFY puts the class of its operands in the
floating point status register; Rs2, Ts2, Rd, and Td are ignored. CMP and CMPU
put the result of comparison in the floating point status register; Rd and Td
are ignored; CMPU signals the invalid exception on unordered; CMP does not.
MOVE, SQRT, and AINT ignore Rs2 and Ts2.  MOVE is used to convert between dif-
ferent formats.  AINT rounds toward zero when modeless.  PREM is used to com-
pute a partial remainder; it sets a bit in the floating point status register
if the remainder is incomplete; it also sets a field in the floating point
status register to indicate information about the quotient bits.  SCALE
ignores Ts2; the second argument is always an integer.

COORDINATION INSTRUCTIONS:

     The coordination instructions are never executed pipelined since they
affect all floating point units at once.  Thus they do not begin execution
until all floating point units are idle.  The format of these instructions is
not specified here.

1.   CONTEXT SWAP COMPLETE FLOATING POINT STATE.

2.   READ/WRITE floating point MODES/STATUS: moves modes or status or both
     between floating registers and integer registers.

3.   FPSYNCH is a conditional branch that checks whether any IEEE floating
     point traps have been deferred.  If NO floating point trap was deferred,
     the conditional branch is taken.  Otherwise the next sequential instruc-
     tion is executed.  That instruction may begin re-execution code if
     appropriate.  The floating point status register contains bits for each
     deferred trap; those bits are cleared by an FPSYNCH.  Thus deferred IEEE
     traps are never actually taken as traps.  The re-execution may take the
     traps if it regenerates them.  The traps may disappear if the initial
     computation was not in extended.

     A null FPSYNCH is one that branches to the next sequential instruction
(the one following the delay position) because no re-execution code is pro-
vided.  If a null FPSYNCH would not be taken, that is floating point traps
should have occurred but no re-execution code is provided, then a system trap
(not an IEEE trap) is taken for "Unrecovered IEEE trap in pipelined mode".

IMPLEMENTATION ISSUES

     HARDWARE: An implementation could be built from a register file, a con-
troller, and a 68881.  It would not be particularly fast.  To get the intended
performance add Weitek ALUs and Weitek multipliers.  Note that the 68881
built-in registers could be used as the register file or the register file
could be external, depending on how many floating point registers are required
for the architecture.

     Looking at the modeless bit and the type fields as extensions of the op
code means a total of 2048 possible floating point instructions; not all are
meaningful.  Some of the meaningful ones might not be supported by hardware;
those can trap to software.

     CODE GENERATION DETAILS: The floating point instructions to be generated
is determined by the types of the operands and the type of the destination,
which may be an anonymous temporary.  The default mode of code generation is
to generate MODEFUL, SEQUENTIAL, EXTENDED-TEMPORARY instructions.  But five
compiler switches instruct the code generator what to do when other results
are desired:

1.  Pipelined switch:

1.1.  Generate only sequential instructions.

     This is the default.

1.2.  Generate pipelined instructions in basic inner loops.

     This is the recommended optimization.

1.3.  Generate pipelined instructions in basic inner loops and in all other
basic blocks.

     This is for benchmarks and people who write 10000 line basic blocks.

2.  Mode switch:

2.1.  Modeful execution.

     This is the default.  Code observes the IEEE rounding modes and trapping
except where languages specify round toward zero for integer-related opera-
tions.

2.2.  Modeless execution.

     Code always rounds toward nearest except certain operations always round
toward zero.  IEEE exceptions are recorded but IEEE traps are ignored - nei-
ther taken nor deferred.

3.  Expression evaluation switch:

3.1.  The type of anonymous temporaries is extended.

     This is the default.  Re-execution code is always sequential and extended
based.

3.2.  The type of anonymous temporaries is determined by "traditional Fortran
rules".

     Traditional Fortran rules that the type of the result is determined
solely by the types of the operands, so that single op single produces a sin-
gle result, etc.  This represents a speed optimization in many implementa-
tions.  Operations involving an explicit extended operand get performed in
extended.

4.  Strictness switch:

4.1.  Generate mixed precision instructions strictly according to the IEEE
standard.

     The standard forbids instructions, like single := single op double, in
which the precision of the result is less that of the most precise operand.
The purpose of the restriction is to maintain uniformity of rounding and to
avoid problems interpreting the wrapped exponent of the result when trapping
on overflow or underflow is enabled.

4.2.  Generate forbidden instructions if convenient.

     The operation double := extended op double generates two instructions in
the default mode, one computing an extended temporary result and one convert-
ing that to the double format of the destination.  In the optional mode, only
one instruction is necessary.

5.  Re-execution switch:

5.1.  Generate FPSYNCH after pipelined blocks followed by re-execution code if
possible.

     This is the default but it has no effect unless pipelined is requested.
Re-execution code is always sequential and extended based.

5.2.  Do not generate FPSYNCH or re-execution code.

     This is suitable for old programs.

     DEFINITIONS: A basic block is a maximal sequence of source language
instructions that has one entry point and one (explicit) exit and contains no
loops or branches or external calls except to externals whose semantics are
known to the code generator, such as abs, sqrt, sin.  A basic inner loop is a
loop containing exactly one basic block and nothing else.

     Floating point programs contain a small proportion of code in such loops
but those loops absorb a majority of execution time.

     Thus to optimize an old Fortran program, specify that all basic inner
loops be pipelined, that IEEE modes not be observed, that expressions be
evaluated by Fortran tradition, and that re-execution not be attempted.  Note
that old programs do not enable any IEEE floating point traps, and never fall
through an FPSYNCH.  Thus re-execution never happens.  Execution will be about
as fast as possible and about as accurate as any old machine.

     To debug an old Fortran program which was not as reliable as you thought,
observe IEEE modes, specify re-execution, and enable trapping on the serious
IEEE exceptions, such as invalid and overflow.  Then the program will run at
the same speed as before until a problem arises.  If the problem arises in
code that is re-executable, the re-execution following the FPSYNCH will either
fix the trap by avoiding the error, thanks to extended, or provide an orderly
pointer to the offending instruction, thanks to sequential execution.

     A basic block is always re-executable.  But besides the re-execution
code, the compiler has to generate extra code to copy the initial values of
all the variables in which stores will occur.  So you probably would not want
to specify pipelined on all basic blocks along with re-execution.

     The compiler has to determine whether a basic inner loop is re-
executable. The main idea is that if a loop only stores into scalars, the ini-
tial values of these scalars are copied to temporaries prior to entering the
loop, recopied again prior to re-executing the loop, and stored only after
bypassing or completing re-execution.  If a loop is not re-executable, then
the code generates a null FPSYNCH instruction so that any deferred IEEE traps
lead to a system trap.

     The common re-executable loops only store into one or two variables,
which may be kept in registers, so copying initial values is not much of an
issue.  As an example the source code
        for i := 1 to n do s := s + x[i]*y[i] ;
could generate this kind of object code:
        move s,ts                       ts is a register variable
        i := 1 ;
1$:     ts := ts + x[i]*y[i]            pipelined op codes
        i := i + 1 ; test ; goto 1$

        fpsynch 2$

        move s,ts
        i := 1 ;
3$:     ts := ts + x[i]*y[i]            sequential op codes
        i := i + 1 ; test ; goto 3$

2$:     move ts,s                       store result

     Examples of common re-executable loops include polynomial evaluation,
scalar product evaluation (linpack sdot), and continued fraction evaluation.
Common loops that are not re-executable include all operations with vector
results x and y, such as computing x := x + k * y (linpack saxpy) and x := c*x
+ s*y (linpack srot).

REDUCING FLOATING POINT ARCHITECTURE

     The number of floating point instructions can be reduced to one quarter
of the number indicated above by replacing Ts1 and Ts2 with a single field Ts
so that the operands have the same type.  The effect of this is that many sim-
ple instructions would become two instructions, the first being a conversion
of the less precise operand to the higher precision, and the second doing the
operation. This is expensive because of the overhead of two instruction execu-
tions and more importantly because the operands can not be left in an unpacked
form between the two instructions as they would be if the two were combined
into one.

     Similarly the operation can not be divorced from the destination format
without impacting performance; all operations would have to compute a full
extended result.