IEEE 754 extended precision expression evaluation - we did it wrong

Thu Mar 14 10:54:23 PST 1991

Don't make any assumptions about who the "we" is until you've read the 
following.

IEEE 754, the Intel 8087, the Motorola 68881, WE 32206,
the now-obsolete SPARC V7,
and Apple's SANE all envisioned a certain style of expression evaluation in
which variables were held in registers of higher precision than storage
and expressions were evaluated in that higher precision.  But now as I'm
finishing up the final testing of the final release of Sun's compilers for
the Sun-3 (the final release for the Sun-386 has been available for 
almost a year) I was struck again by the mistakes made in attempting to define
and exploit extended precision.

The immediate cause of this was an all-night exercise which culminated in
tracking down a "compiler bug" in the IMSL statistical library function RBEST.
I knew it was a compiler bug because it only showed up at -O3 -f68881, 
and went away with -ffpa or -fsoft, and with 
-O3 -f68881 if induction variable optimization were disabled.  It hadn't
showed up in previous Sun-3 compiler releases, either.

Eventually I discovered that everything I knew about this bug, was wrong.
I tracked it down to code that involved statements like

      TWODF = (NDEF+NDEF)*COVWK(LS)
      WK(L,M+5) = TWODF
      IF (WK(M,L+5) .EQ. TWODF) GO TO 270

spread over 552 lines of code.  Expressions involving L and M crop up 
repeatedly in
this code, computed in complicated ways from do loop indices, so
the induction variable connection seemed obvious.  But nothing I could do
seemed to help, until I printed out WK(M,L+5) and TWODF prior to the
comparison to see what was wrong.  Eventually I realized that there was
nothing wrong with the induction variables; the problem was that TWODF had
been allocated to an extended-precision register on the 68881.  It was
selected for register allocation because it was a scalar (array elements aren't
eligible currently) and because it was referred to frequently.

Unfortunately the logic of the code depended upon being able to identify
elements of WK by comparing them to TWODF.  If no array element matches
TWODF then there will be a problem.  What happened was that WK(L,M+5) contained
the value of TWODF rounded to storage precision, and that wasn't the same
value as TWODF contained in an extended-precision register.

There are ways of working around this problem by, for instance, calling a
dummy external with TWODF as an argument; that will force TWODF to be rounded
to working precision in Fortran (but not in C if by-value parameters are 
passed in extended precision).  This workaround won't work forever since 
compilers are getting smarter about inlining even separately-compiled
procedures and optimizing them at link time.

So is this a compiler bug or a bug in the IMSL source code?  It's hard to
imagine ever debugging any large body of code, that was not written with
extended-precision register allocation in mind, to work robustly with such
register allocation.  
All sorts of high-quality mathematical software intended to
(for instance) determine roundoff level can get confused by extended-precision
register allocation.  Furthermore debugging a program to work with one 
particular release of a compiler is not enough; the next release may allocate
different variables and turn up new problems. 

I think it's a bug in the IEEE 754 spec and the 8087 design, which were
rather closely interrelated for a while.
The spec should have been clearer about what was intended; this would have been
problematic since these difficulties that subsequently arose did not occur
in advance to many of us working on it.

The spec should have required that higher-precision expression evaluation
must be accompanied by constraints to round named variables to storage 
precision.  Thus in

	X = A * B + C

X must be rounded to its declared storage precision even if A * B + C
is computed in extended precision (a good thing, in general) and even if
X is actually allocated to an extended-precision register.  This creates
a problem of generating efficient code: the obvious way to do this is to
actually round and store X to memory, then reload the register; but if you're
going to do that, why allocate X to a register in the first place?
Extended rounding precision doesn't help either; if you intend to run most
programs with double-precision rounding then extended-precision registers in
the hardware don't buy you much.

So the 8087 implementation should have included instructions that produced
results rounded to storage precision but still resident in extended registers,
and compilers should have exploited these to avoid those gratuitous store/load
pairs.  In the best case, the foregoing code would have produced a result
A * B in extended precision in an extended register, and then, in one operation,
added that extended result to C, rounded it to X's storage precision,
and put the result in X's extended register.  A simultaneous store to X of
the rounded value would make life simpler for debuggers, although not everybody
tries to support debugging highly optimized code.

If you think about the possibilities, an orthogonal set of op codes adds up
to quite a few possibilities, and confusion may be compounded by an
unfortunate admonition in 754 against operations other than conversions that
produce a result of lesser precision than the operands.

Furthermore, VAX-centric compilers often decompose the foregoing statement into
intermediate language like
	t = A * B
	X = t + C
in which the distinction between user-defined variables X and internal
variables t is lost, so there's no way to get X rounded to storage precision
without also rounding t, eliminating the advantage of extended precision.

Of course, VAX-centric compilers are demanded by VAX-centric customers who've
already written their code, and are happiest if it runs exactly as it did
on their VAX.  They wouldn't want to exploit extended-precision expression
evaluation in a way that wasn't portable to their next machine.  
Traditional C encouraged reliance on double-precision expression evaluation
of single-precision functions, as long as single-precision variables weren't
allocated to double-precision registers, which was not much of a problem
until architectures like R6000 came along 
that favor double-precision over single-precision arithmetic, yet by that
time ANSI-C had dropped the traditional C encouragement of double-precision
expression evaluation.

In the end, for PC-class systems it didn't matter, because IBM and Intel
neglected to make the 8087 standard, and neglected to develop and give away
good compilers that exploited the 8087 or emulated it properly and 
efficiently, creating a market for other
compilers written by people that didn't understand the issues, so that 
over time was lost
much of the potential performance and accuracy benefit the 8087 offered.

The issue is more acute for 6888X systems, and now the 68040 and 80486, 
since most
of these are in workstations and are more likely to be used for heavy 
computation.  It really is important to allocate as many variables as
possible to registers for performance, but when that doesn't work
there's a lot of debugging to figure out why, that may not be pleasant
if the system doesn't offer much support for symbolic debugging of optimized
code.

I suspect that compilers for any architecture need explicit command-line
options or inline directives (you'd want to use them in macros, so ANSI-C
pragma's won't do) that specify whether or not to perform higher-expression
expression evaluation, independent of any other code-generation or
optimization options.  Of course if you are an independent software vendor,
these directives aren't nearly as helpful if they are different for every
hardware platform.