1991 discussion on 128-bit floating-point arithmetic

Wed Jul 19 20:17:43 PDT 1995

To refresh the memory of those who are interested, here are
some of the highlights from last time:

From: validgh!uunet!fig.cray.com!tamaSun.COM (Tom MacDonald)
Subject: IEEE quad precision

I was recently asked about an attempt to standardize a 128 bit floating
point format through an IEEE committee 1754.  Rumour has it that this
format has already been implemented in the HP Precision Architecture.
Does anyone know what the charter of this committee is, and whether or
not a definition of the 128 bit floating-point format is available?
Thanks in advance for any information.

From: Jerry Huck <validgh!sun!iag.hp.com!huckaSun.COM>
Subject: Re: IEEE quad precision
Date: Mon, 3 Jun 91 16:59:30 PDT

> I was recently asked about an attempt to standardize a 128 bit floating
> point format through an IEEE committee 1754.  Rumour has it that this
> format has already been implemented in the HP Precision Architecture.
> Does anyone know what the charter of this committee is, and whether or
> not a definition of the 128 bit floating-point format is available?

wrt PA-RISC (HP's most current name for HP Precision Architecture), it
defines a 128-bit "quad" precision format that complies with IEEE/ANSI
754's requirements for a double extended format.  It is the obvious
definition with 1 sign bit, 15 exponent bits, and 112 (128-15-1)
fraction bits.  It also uses a hidden bit.  None of the current
PA-RISC processors implement quad operations.  Languages currently
access quad functionality through procedural interfaces.

A look at the latest draft proposal for P1754 shows an
identical "quad" format.

IEEE standard group 1754 is defining a standard open microprocessor
architecture based on the SPARC architecture.  Inquires to
philasparc.com (chairman of the working group).

Jerry Huck
Information Architecture Group
Hewlett-Packard

Date: Tue, 4 Jun 91 11:20:33 PDT
From: validgh!dghaSun.COM (David G. Hough on validgh)
Subject: SPARC quad format

The early versions of the SPARC architecture defined the 128-bit format
to have 64 significant bits and 48 unused bits.  In 1987 I proposed
that the quad format be changed to 113 significant bits, and this was
eventually accepted for SPARC V8, possibly because nobody expected to
actually implement it in hardware in their lifetime.

My reasons included:

1) Progress in applying interval arithmetic methods reported at the
9/87 interval arithmetic conference in Columbus.  Conventional interval
methods benefit from lots of extra precision, while Kulisch-style
methods benefit from doubled-precision products.

2) A number of existing Cray customers cited lack of 128-bit format as
a reason for staying with Cray mainframes rather than moving to Unix
workstations.

3) Spectrum, I mean Precision, I mean PA-RISC, defined 128-bit quad and
I figured that HP had probably researched that issue as much as other
aspects of the architecture and concluded it was the appropriate future
direction.

4) Quad can be used as an extended precision for expression evaluation
too as long as you're careful that storage-precision variables
allocated to extended registers are rounded to storage format whenever
assignments to storage format occur in the program source.

5) On a 64-bit bus, anything larger than 64 bits will require two
accesses, so you might as well use all the bits.  Memory bandwidth is
probably the gating factor on floating-point performance in 
high-performance designs.

6) At least one hardware designer thought that although it would take a
little longer to get a quad precision adder implemented, it would not
take so much longer that it was worthwhile doing a klugey type of quad
precision such as composing one from two doubles.  Such arithmetic
can't be both fast and clean, especially if exceptions are to be
handled in a reasonable way.

7) I had heard rumors that IBM would implement "fused" (one rounding)
multiply-add some day, and the inherent greater generality and
flexibility of a full quadruple precision implementation would be
advantageous eventually.

8) To be useful for supporting double-precision computation, it's not
necessary to implement the full quad instruction set the first time.
The essential instructions are fdmulq (doubled-precision product),
faddq, fsubq, fdtoq, fqtod.  In particular fmulq, fdivq, and fsqrtq are
harder and can wait.

SunOS 4.1.1 supports kernel emulation of the SPARC V8 quad
instructions, although you wouldn't want to use it - too slow.  C 1.1
includes a long double type and Fortran 1.4 includes real*16 and
complex*32 which implement quad arithmetic in software without trapping
to the kernel.  If a surprise SPARC implementation with some quad
hardware came along (Sun doesn't know who all the SPARC implementers are)
it would be simple to rig up a .il inline
expansion template file that would allow these compilers to exploit
that hardware with good effect although somewhat less than would be
obtained if the compilers generated the instructions inline.  Anyway
SPARC V8 is being taken over by SPARC International and IEEE P1754.

Here's an example of using quad to support double.  The real part of a
complex product is computed as xr*yr-xi*yi.  If all computation is done
in double then rounding errors in the products can dominate the
significant figures of the result if cancellation occurs.   Furthermore
gratuitous overflow and underflow may affect intermediate results even
if the final result would lie in range.  Not to mention that SPECmarks
suffer if eight fpops are used instead of the minimal three.  Even so
that's a price we have paid with Sun Fortran (for complex*8) in order
to reduce the rate of roundoff accumulation.

A brute force way to deal with this is to promote all the operands to
extended or quad.  That means providing extended or quad
multiplication, which is apt to be fairly slow.  Far more economical is
to compute dprod(xr,yr)-dprod(xi,yi) which requires fdmulq and fsubq,
followed by fqtod.   That requires only four fpops and suffers one tiny
rounding error from the fsubq and a normal one from the fqtod.   So the
result is almost but not quite correctly rounded; to obtain correct
rounding would require only adding another operation fqsubd to obtain
the sum of two quad operands rounded to double precision with only one
rounding error, and with only three fpops too, same as the simplest
method, but with far better immunity against roundoff and exponent
spill.

IEEE 754 requires that operations round to no lower precision than the
more precise operand, and so fqsubd would seem to be a nonstandard
operation.  As I recall the reason for the 754 requirement was so that
exponent wrapping on trapped overflows and underflows would be
meaningful, but SPARC doesn't use exponent wrapping and I expect that
counting mode for overflows and underflows will never be worth its
cost.

In contrast conventional extended precision will still suffer roundoff
in computing xr*yr and xi*yi, which may be magnified by cancellation,
so the final rounded result can't be said to be nearly correctly
rounded.

Similarly fused multiply-add without extended or quad has to compute
	t = xr*yr 
	r = t - xi*yi 
and the benefit of one roundoff error
in the second step is lost in the noise of the normal roundoff error in
computing t.

Date: Tue, 4 Jun 91 12:27:22 -0700
From: validgh!uunet!mips.com!earlaSun.COM (Earl Killian)
Subject: SPARC quad format

   Date: Tue, 4 Jun 91 11:20:33 PDT
   From: dghavalidgh.com (David G. Hough on validgh)

   6) At least one hardware designer thought that although it would take a
   little longer to get a quad precision adder implemented, it would not
   take so much longer that it was worthwhile doing a klugey type of quad
   precision such as composing one from two doubles.  Such arithmetic
   can't be both fast and clean, especially if exceptions are to be
   handled in a reasonable way.

I agree that if you decide to build a 112-113 bit datapath, it's not
much more difficult to design than building a 53-bit datapath.  But
that misses the point; it does take more than twice the space, which
is likely to relegate quad precision to software for a very very long
time.  If people really wanted quad, and want tolerable speed, then
they probably would want precision 106-bit, because that could be done
in hardware using two passes through an existing 53-bit datapath.

So it seems to me that you can have the extra bits and be forever so
slow that no one will ever use it, or you can cut back slightly and
have a chance that someday it will be fast.

Of course you might argue that 80-bit hardware has the datapath
required to do 112-113 bit quad with two passes.  True, but it's not
clear 80-bit has much future either.

Date: Tue, 4 Jun 91 13:44:39 PDT
From: validgh!dghaSun.COM (David G. Hough on validgh)
Subject: more on quad

Earl Killian points out that 113 significant bits products are
not easy to build on 53 significant bit multipliers.  I was thinking
more of building them in conjunction with 64 significant bit multipliers
associated with 64x64->128 bit doubled-precision integer multiplication.

And I was incorrect in my analysis of complex multiplication with fused
multiply-add.  Although the simplest approach

	t = xr*yr
	r = t - xi*yi

can have large relative rounding error, for 2X the fpop's you could do instead

	t1 = xr*yr
	t2 = xr*yr - t1
	t3 = t1 - xi*yi
	r  = t3 + t2

and it looks to me like the worst case bound on roundoff would be about three
times larger than the bound on roundoff in a correctly-rounded result.
Of course both are susceptible to intermediate exponent spill.

One unsettling fact about fused multiply-add is that
a*b+c*d might be evaluated either as

	t = a*b
	r = t+c*d

or

	t = c*d
	r = a*b+t

and the rounded results may differ.  But for most programs that's probably no
more trouble than how extended precision has usually been implemented.

Date: Wed, 5 Jun 91 17:04:30 BST
From: Mark Homewood <validgh!uunet!meiko.co.uk!fredaSun.COM>
Subject: Numeric Intrest : Quad formats, General

> In-Reply-To: Earl Killian's message of Tue, 4 Jun 91 12:27:22 -0700

In general fp datapaths tend to be about 3 bits wider than the rounded fraction
width.

ie 53 + 3 => 56 and (112 + 3)/2 => 58, ie very little I don't imagine when
thinking about a two pass scheme to implement a quad format 2/3 bits won't break
the bank. More intresting would be an argument about the size of the
exponent, 22 or 16 bits and whether thats enough for one's calculations in
a format to end all formats.

People I've spoken to would prefer a format, representable as two
doubles say which can be executed as a double length procedure calls. This
has the benifits of being vectorisable etc on current and next generation
(32 bit) machines. Hardware support to speed it up would naturally be a
possibility.

Supporting 3 formats and their cross product of conversions in a hardware
fpu implementation seems a bit history repeating itself and as an implementor
I'd rather not thank you.
Better to fill the intermediate gap with some fast(ish) library calls which
use your longest format (double), and in the long term (5 yrs) forget single
length in the same way we`ve all forgotten about 16 bit object (halfs, shorts,
stubbys or whatever they were called). Better to make a clear differenciation ?
about how we want these things to develop or we'll all be going through the
CISC/RISC revolution again. Which is probably necessary anyway
to get rid of all the architectures that have mushroomed warts and things,
but not perferable.

Fred

Meiko Scientific Ltd, Bristol, England

Date: Wed, 5 Jun 91 22:12:49 PDT
From: Jon L White <validgh!uunet!lucid.com!jonl%kuwaitaSun.COM>
Subject: SPARC quad format

I'm a little surprised by one of the reasons you cited for progressing
ahead with the Sparc V8 128-bit float format.  You mention some experiece
with Cray customers who refused to buy in to "Unix" (or, Sun products?) 
because it didn't support the 128-bit format.  

Now, quite coincidentaly I've been carrying on a private conversation with
Tim Peters at Kendall Research who apparently used to be a Fortran compiler 
hacker for Cray [the basic topic was the user of "rounding modes" to aid 
in detecting algorithmic instability.]  One of his tales from the Cray days
pointed to something I've suspected for years, and am always looking for
hard evidence to support; here is Tim's "story":

    My favorites were the cases where someone ported from a 32-bit
    world and claimed to get "the wrong answer" on the 64-bit Cray:  on more
    than one occasion, after plotting "the answer" against a wide range of
    -trunc values, we found that "the answer" varied wildly around the
    -trunc values that corresponded to their 32-bit machine's significand
    size, and that "the answer" settled down quickly as the precision
    increased.  There was no more effective means of suggesting that what
    they had been accepting as "the answer" was cheese than to work up this
    kind of graph.

In response to this, I recounted the bit of search/research that led Lucid
to adopt a unform, double format as the normal float representation within
Lisp, despite the permission of the language standard to allow up to four 
distinct user-visible formats.  And if I read Mark Homewood's message right,
he too would prefer to drop the multiplicity of formats that made much more
sense back when INT and LONG really needed to be different on a PDP-11.

My "story" doesn't actually rule out all need for quad formats, but it
has some substantiation that it will be of lesser, or rarer, utility.
(And perhaps just a handful of Cray adherents demanding quad precision
would be enough to provoke some serious concern.)  So here is the Lucid
story in full:

Way back in late 1986, when Lucid was planning its 3.0 release, we faced
the question of whether or expand out to four distinct float formats,
as permited by CLtL ("Common Lisp: The Language").  Two obvious things
mitigated against doing this: it would be a moderate amount more work
for the part of our optimizing compiler that was doing data representation
changes for optimization (i.e. converting from "pointers" to actual the
machine representation for floats, and back again by "consing" when
necessary -- all in order to do Fortran level numeric optimization and
to side-step any intermediate consing); and it would generate further
incompatibilities not only with our 2.1 release that supported only
one 32-bit format, but likely with other vendors as well.  Not surprisingly,
a couple of other Lisp vendors widened out their representational schemes 
to include distinct formats for SINGLE, DOUBLE and LONG, but failed to 
understand the portability failure under such schemes [evey Guy Steele 
misunderstood this point until I reminded him that Lisp's introspective
typing capabililty meant that one could notice the difference just by a 
form like  (typep 1.2s0 'double-float). It wasn't just Guy; it was the 
whole of the X3J13 committee.]  

Well, after all, the multi-choice scheme for CLtL was chosen in the very 
early 1980's when, for example, double-floating operations on typical 
hardware were from 3 to 10 times slower than single;  it seemed to make 
sense to permit Lisp to be cluttered up with more choices of representational
matters than you could shake a stick at.  But from looking at the MC68881 
manual, I suspected that by the late 1980's, this speed advantage would 
shrink or disappear; so we began quietly asking our contacts in customer 
land whether they *really* used floating point arithmetic, and how much 
speed mattered.  The killer response came from a guy at Alliant: he was a 
Fortran user/implementor and hardware hacker who was only vaguely familiar 
with Lisp, but was willing to talk to us.  He pointed out that:

  (1) Alliant was planning to switch from a 32-bit memory bus to a 64-bit
      one, so the memory cost of using doubles would only be space, not time;
  (2) A large number of numerical algorithms in common use were unstable
      enough to produce "cheese" under the IEEE 32-bit format, but *none*
      were unstable enough to be proven losers in the 64-bit format; i.e.,
      there were algorithmic reasons for requiring at least 64, but none
      (or, none known) for requiring higher precision.  [he was speaking
      for the Fortran user community of which he was aware]
  (3) Alliant was currently asking for bids on hardware sub-components;
      their demands were:
      (3a) the fastest possible IEEE-compliant 64-bit format chips
      (3b) 32-bit format operations would not be required, but would
           be tolerated providing they were not more than about 50%
           slower than the double format.

Yes, that is no typo.  SINGLE is ok as long as (for backwards compatibilty)
running it isn't significantly slower than the "default" DOUBLE.  That in 
fact seemed to be the "state of the FP art" even back then.  So that ended
my survey right there.  We made out decision to opt for only one "first 
class" FLOAT type, and you can guess what format it was.  But of course we 
supported packed floats too -- SINGLE format in arrays, just as there are 
packed integer arrays -- this is also required to have compability at the
Foreign Function Interface level where both C and Fortran programs may 
require the use of packed float arrays.

I hope I haven't misremembered or misrepresented the Alliant concerns.

Date: Thu, 6 Jun 91 06:04:42 PDT
From: validgh!dghaSun.COM (David G. Hough on validgh)
Subject: floating-point data types

Most physical data is only good to single precision, so 32 bits is ample
to store it; more is wasteful.  So a 32-bit storage format will always be
useful.  

Some 32-bit operations will always be faster than 64-bit; division and
sqrt in particular in hardware, and elementary transcendental functions
in software.  So there will be performance reasons for continuing to want
32 bits.

Since the data is good to 32 bits, most algorithms could use 32-bit arithmetic
most of the time, with occasional extensions to higher precision at critical
places.  However recognizing the critical places in complicated programs is
not so easy, so it may be cheaper to simply to pay for the hardware than to
pay for the mental analysis.   Using 64-bit arithmetic on complicated programs
makes many of them work acceptably well despite roundoff without any further
analysis.  So that's what's often done.

Even so there are sufficient applications for 32-bit arithmetic, such as graphics
and signal processing, that I think that it will be advantageous to provide for
a while.  64 and 128-bit buses can be used to move vectors of 32-bit numbers
or 32-bit complex numbers.

On the other end, it seems to happen often enough that people have strung
together several unstable algorithms so that 64-bit arithmetic is insufficient.
Furthermore when data is known to better than 32-bit precision, it may be
helpful to have 128-bit arithmetic for the critical parts of those algorithms
that are mostly in 64-bit arithmetic.  And interval arithmetic can burn up
precision very quickly too.  Sun France once nearly lost a deal over 128-bit
arithmetic, contemplated building a coprocessor and extending a compiler
to handle it, but in the end got the deal by having former employee 
Andre Lieutier analyze the problem and suggest a cheaper solution.  Turns
out the customer was doing linear least squares in the worst possible way.
Nothing new about that - twenty years ago when I was a "math software 
consultant" at the UCB computer center I quickly learned that whenever anybody
asked for a double-precision matrix inversion routine (for the CDC 6400)
they were really doing linear least squares in the worst possible way.

Anybody considering implementing 128-bit extended or quad should consider
whether it would be better to provide variable-precision floating-point
instead (implemented in software with hardware support like 64x64->128
integer unsigned multiply, 64+-64 unsigned integer addition with carry
in and out, 128/64->64,64 unsigned integer quotient and remainder, etc.)
but I think that 90% of the need can be handled with 128-bit quad of
106 or more significant bits so that products of doubles can be held
exactly... and the necessary instructions are a lot easier to 
understand and justify in a RISC paradigm without hypothesizing about
language extensions necessary to adequately support variable-precision
floating point.

Date: Thu, 6 Jun 91 09:29:02 PDT
From: Jon L White <validgh!uunet!lucid.com!jonl%kuwaitaSun.COM>
Subject: floating-point data types

re: Most physical data is only good to single precision, so 32 bits is ample
    to store it; more is wasteful. . . . 
    Since the data is good to 32 bits, most algorithms could use 32-bit 
    arithmetic most of the time, with occasional extensions to higher 
    precision at critical places.  However recognizing the critical places 
    in complicated programs is not so easy, so it may be cheaper to simply 
    to pay for the hardware than to pay for the mental analysis.   

Or, "cheaper than hiring a numerical analyst to figure out the RIGHT
answer"!  Trying to squeeze each-and-every program variable down to
it's minimal (significand) size may be penny-wise and pound-foolish.

Our decision for Lisp to provide packed arrays for 32-bit storage, but 
not to bother with with finer-than-double variations for individual 
variables, focuses on the one place where storage size really counts --
in huge arrays.  Indeed one would expect that time-critical operations 
over such arrays -- like for graphics and signal processing as you
mentioned -- would either be done by independent hardware co-processors,
or (in Lisp's case) by "foreign" function libraries especially optimized
by good analysts/programmers.  

re: I quickly learned that whenever anybody asked for a double-precision 
    matrix inversion routine (for the CDC 6400) they were really doing 
    linear least squares in the worst possible way.

Verrry interesting.  What may be lacking is some wider availablility
of how to asses software packages.  Not being in that world myself,
I wonder if there are generally accepted standards of the quality
of LINPACK like routines?

Date: Thu, 6 Jun 91 13:47:37 -0700
From: validgh!uunet!wk81.nas.nasa.gov!rcarteraSun.COM (Russell L. Carter)
Subject: Comments on quad precision

Since the numeric-interest mailing list seems to be in story
mode, I thought I might add a story of mine to the list, which
is relevant to the issue of quad (128 bit) precision.
My work at NAS has focused on the performance evaluation of 
parallel supercomputers, and inadvertently has rubbed up against
a specialty of this list, the (in)accuracy of floating point arithmetic.

We operate two Crays at NAS; a Cray 2 and a Y-MP.  It seems that
one of our users achieved significantly different numerical results
for the same calculation performed on the two machines.  The 
calculation is parallelized variant of Cholesky decomposition,
followed by back substitution, of a symmetric positive definite
matrix arising from a structures problem.  The matrix is of order 
16146.  The interesting calculated quantities were the max norm 
of the solution vector and the 2-norm of the residual vector.  
The user had access to a wide variety of floating point boxes: IBM 3090, VAX, 
IEEE, Cray 2, and Y-MP.  The problem was unaffected by parallelization.

He found that the max of the solution vector decreased with decreasing
mantissa length (56>=56>53>48>=48), while the residual increased.
So far so good, a check of the condition number, which turned out
to be > 10^11, is consistent with this behavior.  Unfortunately,
there was a significant difference between the Crays, with the Y-MP
having two *decimal* digits worse accuracy than the Cray 2.  Since
the Crays nominally have the same floating point architecture (but
not exactly!)  and in particular the same mantissa length, and
the Y-MP had a month previously been certified by my supervisor
as having passed the Government's 30 day acceptance test during which 
the hardware was thought to have worked correctly, there was considerable
consternation and activity concerning this result.

A bit of bureaucratic manuevering simplified the abstract contractual
problems, but to guard against future surprises I was encouraged to
investigate the anomaly further.  Having had a course in floating
point arithmetic based on Sturbenz, I had a little idea about what
to look for, but fortunately, I didn't have to think very long,
because who else but Kahan began to bang on the door (well actually 
by phone) demanding to be told about the problem.  I did so, and
spent the rest of the day running experiments that discredited
my initial theories about where the problem was located, to Kahan's
delight.  Where upon Kahan declared that he knew what the problem
was, that I should stay away from it (!), and that it had to do
with the the Y-MP's subtraction operation, and the positive 
definiteness of the matrix (a clue!) which makes the diagonal entries
of the Cholesky factors a little larger.

To make a long story short, consultation of Cray manuals disclosed
that while neither Cray has a guard digit for the subtract
magnitude operation, the Cray 2 has a rounded preshift, while
the Y-MP does not. The effect is that the Cray 2 produces a subtract
magnitude result that is uniformly bad about the correct result, but
the Y-MP always produces a result that if it is in error is larger
than the exact floating point result.

Examination of the algorithm and code showed that the diagonal 
entries of the Cholesky factors are computed from subtract 
magnitude operations, and I performed an analysis
for a test matrix obtained from the discretized laplacian on
a square domain that provides bounds for difference in the interesting
quantities produced by the performing the calculation in
the two arithmetics that is consistent with the data, and Kahan's
suggestion.  Additionally, I had a summer intern solve a large 
number of test cases using the LINPACK Cholesky routine that produced 
data consistent with the model.  So this slight difference in the 
subtraction operation of the machine can conceivably render 
unreliable certain calculations of interest to NASA.

Now, a simple approach to improve the reliability of this computation
would be to carry out the calculation in Cray double (128 bit) precision.
Unfortunately, this approach is not viable for us for two reasons.  The
first is that doubling the required amount of storage is bad because
we are already trying to solve problems that tax the memory limits
of all available computer systems.  The example problem I mention
requires about 80 Mbytes of core memory, and we would like to
solve problems 10-100 times larger.  Add to this the fact that
parallelizing calculations nearly always significantly, if not
drastically, increases memory requirements, and it's obvious that
128 bits is not the way to go.

The second problem is the obvious one:  128 bit calculations are
slower on all machines, and waiting days for an accurate answer is
not good.

Now, as matrices scale larger, condition numbers scale larger as well
and we became worried that we might in the future buy a system
that could carry out the required floating point operations to solve 
larger problems in a reasonable amount of time, but with little or 
no accuracy.

About this time I began familiarizing myself with the LAPACK project,
and in particular, the new error analyses that the new algorithms
took advantage of.  One of the results of this work is that iterative
refinement is often very valuable when applied to factors obtained
from less stable methods than classic Gaussian elimination with
partial pivoting.  The nice result is that double precision is not
necessary for calculation of the residual, so it can be very fast
on a machine like a Cray which has abominable 128 bit floating
point performance.  I applied this "single precision iterative refinement"
to the Y-MP Cholesky factors, and voila, achieved accuracy equivalent
to the Cray 2 with one iteration (a trivial amount of cpu time).
So in summary, this is another case where application of numerical 
mathematics has solved a problem caused by the limitations of computer
hardware.

Over the last year or so I have looked for problems of interest to
our user community (linear systems arising from discretizing pdes)
that exhibit sensitivity to word size like the
problem I described above, and I haven't found any others.  If anyone
on this list has interesting counterexamples, I'd love to see them.  

Thus I'm led to believe that 64 bit (even dirty!) floating point is adequate
for NAS CFD work for at least the next 10 years or so, and we won't 
be putting specifications for 128 bit hardware floating point into 
our procurements anytime soon.  I do think that having quad precision 
can be very useful for diagnostic purposes, in particular I used it 
extensively while analyzing the above problem, and I hope it is 
available on future systems.

Russell Carter
Numerical Aerodynamic Simulation
NASA Ames Research Center
rcarteranas.nasa.gov

Subject: -trunc, float sizes, heat death (was Re: SPARC quad format)
Date: Fri, 07 Jun 91 01:39:10 -0400
From: Tim Peters <validgh!uunet!ksr.com!timaSun.COM>

Jon quoted a piece of my E-Mail recently, & it occurs to me that it
won't make sense to people who don't know what the "-trunc" referred to.

That's a compiler option Cray has always supported in Fortran:  if you
say -trunc=n on the command line, where n is an integer in [1,47], the
compiler generates code after every computational fp instruction to
unconditionally zero-out the last n bits of the result.  Thus rather
than using Cray's native 48-bit arithmetic, it simulates the effect of a
different kind of poor arithmetic for your choice of significand sizes
(up to the native size).

This proved to be an extremely effective tool in confirming, isolating
and repairing suspected numeric instabilities.  If there's interest,
I'll write more about that when I have some time.  Alas, while it was
remarkably cheap to do this via software on the Cray architecture, the
trend toward split register sets makes it a much less attractive
proposition for general production use.  It's dirt cheap to do in
hardware, though, "if only" there were sufficient market interest to get
someone to bother (that does seem to be a continuing theme on this
mailing list <grin/sigh -- where's Herman Rubin when you need 'im?!a>) ...

The piece Jon quoted had to do with the particular use of this option to
take "the blame" off Cray when initial program ports didn't "work".  If
you reread the quote, knowing what the "-trunc" part was about, it
should make more sense now.

But the story doesn't end there, & how it *does* end is in a different
direction than Jon took it.  The deal is that people who knew what they
were doing generally didn't buy Crays to speed up their programs (Killer
Micro aficionados take note <grin>), but to make it possible to run
*larger* problems.  In other words, it's generally not the case that
they have a fixed workload they want to run faster, but generally is the
case that they have a fixed constraint on the wall-clock turnaround time
for a single problem instance, and want to a run a larger problem
instance in the same amount of time (caution:  a time quantum here may
be a minute, but for some customers it was measured in months).

A "problem instance" typically took O(n^3) time, so a problem instance
"twice as large" (e.g., a 3D simulation with the dimensions doubled or
the mesh size halved) takes about 8 times the computation to complete; &
a problem instance 3 times larger takes ~27x times the computation etc.

So if 32 bits was already inadequate to run the problem sizes they were
running, it was *way* inadequate to run the problem sizes they *wanted*
to run.  At the time I left Cray (about 3 years ago), the problems were
getting large enough and the machines were getting fast enough and the
compilers were getting smart enough that a few more-or-less vanilla
numerical programs were indeed pushing the limits of what could be done
with Cray's 64-bit format.  All those "extra" bits get chewed up real
fast in this framework, and of course Cray's famous arithmetic only
makes it worse.

754's double format has five more significand bits than Cray's 64-bit
format, and 754's arithmetic is much better behaved, so a switch to 754
should buy several more years for the "grand challenge" kinds of monster
number-crunching programs.  But as we build toward routine teraflop
machines and beyond, the problem instances will grow to match, and 64
bits will no longer cut it on the high end.

In the meantime, some form of quad or doubled precision (take your pick:
users don't care <0.4 grin>) support *is* occasionally necessary, and
I'd guess you could pick up a small part of Cray's market by doing it
faster than they do (Cray has no hardware support at all; but then few
people value *blazing* 128-bit support enough to pay for it).  Beyond
that, of course people can put in all the goofball formats they want,
but I don't see the high end of the market willing to *pay* for more
than two.  We can chicken-or-egg it as to why Fortran has, up until F90
Bloat, supported only REAL and DOUBLE, but the possibility that it's
because it's a darned good compromise all around shouldn't be dismissed
out of hand.

Date: Fri, 7 Jun 91 10:45:29 -0700
From: validgh!uunet!wk81.nas.nasa.gov!rcarteraSun.COM (Russell L. Carter)
Subject: more on float sizes

Tim Peters writes, while explaining -trunc:

<This proved to be an extremely effective tool in confirming, isolating
<and repairing suspected numeric instabilities.  If there's interest,
<I'll write more about that when I have some time.  Alas, while it was
<remarkably cheap to do this via software on the Cray architecture, the
<trend toward split register sets makes it a much less attractive
<proposition for general production use.  It's dirt cheap to do in
<hardware, though, "if only" there were sufficient market interest to get
<someone to bother (that does seem to be a continuing theme on this
<mailing list <grin/sigh -- where's Herman Rubin when you need 'im?!a>) ...

I have to agree, enthusiastically, about the value of -trunc.  We
discard any benchmark that displays numeric instability when truncating
5 bits.  There are quite a few, and many times the code authors
are unaware how sensitive the numbers they are producing really are
to the word size of the machine.  Count NAS in for "sufficient interest".

Tim also writes:

<So if 32 bits was already inadequate to run the problem sizes they were
<running, it was *way* inadequate to run the problem sizes they *wanted*
<to run.  At the time I left Cray (about 3 years ago), the problems were
<getting large enough and the machines were getting fast enough and the
<compilers were getting smart enough that a few more-or-less vanilla
<numerical programs were indeed pushing the limits of what could be done
<with Cray's 64-bit format.  All those "extra" bits get chewed up real
<fast in this framework, and of course Cray's famous arithmetic only
<makes it worse.

This is certainly a sensible and intuitive position to take, but
the evidence I've seen doesn't support this.  As one of the people
responsible for buying equipment that is supposed to be capable of
tackling "Grand Challenge" problems, it may not be surprising that
I'm really interested in what the word size requirements will be
in order that these problems get solved accurately.

I mentioned above that we see a lot of codes that output numbers
very sensitive to the word length the code was run on.  However,
in nearly every case, the numbers are the result of a sensitive
calculation, like computing residuals, for example, that are not
the primary values of interest, which might be pressures, velocities,
or displacements.  Almost without exception, whenever we have found 
an instability in the interesting quantities, it has turned out
that the model was bad, the grid was poor, or the numerical method was
inappropriate.  I'll repeat a request from my previous email
to this list: if any of the recipients of this list knows of an
exception, preferably one where I can examine the code, I'd *love*
to see it.

This is not to say that other fields, with different numerical
methods, aren't already in need of 128 bit floating point.  I
just don't think that the aerophysics part of the "Grand Challenge"
problems will require it.  Aerophysics problems solved at NAS
generally involve the solution of extremely large, very narrow
bandwidth sparse linear systems.  As the error analyses that
accompany the LAPACK algorithms show, these types of systems
can in general be solved very accurately, as long as the condition
numbers are moderate.  I don't have any evidence that the
condition numbers are anything other than moderate.

So in this age of buying computer systems by the pound, I much
prefer to keep my memory requirements modest (well if you can call
10 Gbytes or so modest) by keeping the hardware floating point 
computations at 64 bits, and improving my numerical techniques
if more accuracy is required.

Russell Carter
Numerical Aerodynamic Simulation, ARC
rcarteranas.nasa.gov

Date: Thu, 20 Jun 91 06:07:49 PDT
From: validgh!dghaSun.COM (David G. Hough on validgh)
Subject: pivoting, Linpack performance, tunable SPEC results

I'm getting myopic, I guess; on my assertion that pivoting doesn't hurt
performance much, Wendy Thrash writes:

> I'd have to dispute this for compilers that do blocking, since it seems
> to be exactly the pivoting that prevents blocking of certain routines.

> Porterfield, for example, in his dissertation, classifies LINPACKD
> as "non-transformable" because of its pivoting.  To put this in perspective,
> Wolf and Lam, in their upcoming SIGPLAN PLDI paper, report about 20-25%
> speedup via blocking of a 500 x 500 LU factorization _without pivoting_
> on a single processor of an SGI 4D/380; this blocking also more than doubles
> the four-processor speed and quadruples the eight-processor speed over the
> unblocked versions of the algorithm for four and eight processors.

I guess the moral is that on the 1000x1000 Linpack benchmark (where rewriting
the source code is expected) it would be OK
to rewrite it in quad and throw away the pivoting.  Or maybe just throw
away the pivoting.

Date: Thu, 20 Jun 91 11:27:12 -0700
From: validgh!uunet!mips.com!earlaSun.COM (Earl Killian)
Subject: pivoting, Linpack performance, tunable SPEC results

   Date: Thu, 20 Jun 91 06:07:49 PDT
   From: dghavalidgh.com (David G. Hough on validgh)

   I guess the moral is that on the 1000x1000 Linpack benchmark (where
   rewriting the source code is expected) it would be OK to rewrite it
   in quad and throw away the pivoting.  Or maybe just throw away the
   pivoting.

I once submitted a 64-bit non-pivoted result for 1000x1000 to Jack
Dongarra, the benchmark's author, and he said come back when you can
do pivoting.  So I don't think your last sentence will fly.

On the other hand, LAPACK shows that pivoting need not get in the way
of blocking.  Of course, I wouldn't expect a compiler to turn LINPACK
into LAPACK automatically.