bit-for-bit IEEE compliance vs performance

Thu Sep 28 08:04:00 PDT 1995

A number of recent numeric-interest messages have discussed how
compilers should handle the 80x86 extended precision stack, and in
particular if/when/how a compiler should round extended precision
values to normal storage precision (typically double).

It seems to me that underlying this discussion is a somewhat more
general question: if/when/how a compiler should sacrifice performance
to achieve bit-for-bit IEEE compliance?

Vaughan Pratt <prattacs.Stanford.EDU> gave an nice test problem for
this discussion a month or so back:

|                 double w, x, y, z;
|                 ...
|                 x = y + z;
|                 ...(no further assignment to x)
|                 w = x - y;
| 
| One cannot deduce that the last line uses x rounded to double, since if
| y + z is still in a register (and hence at extended precision) when the
| last line is reached, gcc for the x86 is happy to treat x as synonymous
| with y+z (very naughty), which makes w exactly 0 in this case.  Whether
| y+z *is* still in a register can depend (at compile time) on code
| *following* the last line, with the result that whether w is zero can
| vary as you vary the following code (but not the values in y and z)!

As Michael Meissner <meissneracygnus.com> pointed out, on 80x86 machines
there's a nontrivial performance penalty for rounding x to double here:

| Speaking as a member of the GCC team, I believe your statement about
| it being possible to fix GCC with *any* inconvenience or unpleasant
| surprises to the users to be wrong.  I can't see any real way of
| fixing the various problems without making the code slower.  In
| particular, one of the problems is GCC spilling registers from the FPU
| stack to memory and losing precision.  It is certainly doable to store
| and restore the numbers in 80 bit format when spilling registers,
| which at least would prevent much of the nondeterminism.  The problem
| is that the extended store/load instructions take more cycles.
| Another way is to set the machine PSW to evaluate everything in
| double.  This takes cycles also, and there is the chance that if you
| leave the bit in 64-bit mode, it will break the OS libraries,
| particularly printf and friends.

Right now I don't want to get in to the question of whether or not
hardware designs *should* impose a nontrivial performance penalty for
bit-for-bit IEEE compliance, for the nonce I'll just take it as given
that in (eg) 80x86 and Alpha such penalties are present.

It seems to me that the underlying problem is that there are (at least)
two basic sorts of FP code in common use:
(a) Code where FP rounding errors are unimportant, typically because
    they're much smaller than other things, like finite differencing
    truncation errors, inaccurate input data, etc etc.  This sort
    of code probably doesn't care about bit-for-bit IEEE compliance,
    or quite likely even about IEEE compliance at all.  In other words,
    this sort of code would prefer maximum performance consistent
    with "reasonable" results.
(b) Code where FP rounding errors *do* matter, for example the sort
    of algorithms described by Jonathan Shewchuk <jrsacs.cmu.edu>:
    | I work on algorithms for simulating arbitrary-precision
    | floating-point arithmetic (refinements of algorithms of Doug
    | Priest).  I have some nice fast algorithms with exceedingly
    | tricky correctness proofs, and I don't believe the algorithms
    | will work in the presence of double rounding (though the
    | pathological cases will probably be quite rare).
    This sort of code obviously does care about strict bit-for-bit
    IEEE compliance, if necessary even if there's a performance hit.

I think these are both legitimate and important types of FP-code, and
hence these both need {compiler,hardware} support.

So... I guess what I'd like to see is a standard way for the
programmer to tell the compiler which sort of compilation semantics
are desired.  To minimize the performance hit when we intermix
strict-IEEE and non-strict-IEEE code, the choice should be adjustable
on a fine-grained basis, ideally per-statement.  For block-structured
languages (eg C) it would be nice if the choice could be statically
nestable.

Because (a)-code probably accounts for a large majority of the FP ops
executed and FP $$$ spent per year, it should probably be the default.
In any case, the computer industry is unlikely to accept a default which
is anything other than "maximum SPECfp", i.e. maximum performance on
don't-care-about-rounding code.

To be useful in practice, the appropriate source-code incantations
and their semantics have to be standardized across all platforms for
a given language, and ideally this should be done for (at least) Fortran
and C.  (Hmm, backwards compatability might be achievable without too
much trouble in C via reserved header files and #define, but it may
be a serious problem for Fortran.)

An example of the sort of thing I'd like might be:

	__nonstrict_IEEE_semantics__	// modulo static nesting,
					// this block will be compiled
					// for maximum performance, even
					// if this breaks IEEE compliance
	{
	double w, x, y, z;
	int i, N;
	double h, *A, *dxx_A;

	__strict_IEEE_semantics__	// this block will be compiled
					// with strict IEEE semantics,
					// even if this means lower
					// performance
	{
	x = y + z;	// strict IEEE rounding guaranteed here
			// (we'll suppose)
	// ... other code
	// no further assignment to x
	w = x - y;	// x here guaranteed to be strictly IEEE rounded
			// (we'll suppose)
	// ... other code
	}

	// we're outside the strict-IEEE-semantics scope, so
	// the following code again will be compiled for maximum
	// performance
	h2_inv = 1.0 / (h*h);
		for (i = 0 ; i < N ; ++i)
		{
		dxx_A[i] = h2_inv * (A[i-1] - 2.0*A[i] + A[i+1]);
		}
	}

Ignoring questions of syntax, and assuming we can pin down just what
the semantics of strict-IEEE code should be (this isn't trivial,
especially given that variable use-definition chains will cross both
ways between strict and non-strict blocks), this sort of scheme would
seem to offer the best of both worlds: bit-for-bit IEEE compliance
when we need it, and maximum performance when we don't.

Comments?

In particular, can any of the compiler wizards on this mailing list
comment on the effort involved in implementing something along these
lines in (say) gcc?

- Jonathan Thornburg
  University of British Columbia / Physics Dept / <thornburaphysics.ubc.ca>
  "Washing one's hands of the conflict between the powerful and the powerless
   means to side with the powerful, not to be neutral." - Freire / OXFAM