floating-point data types

David G. Hough on validgh dgh
Thu Jun 6 06:04:42 PDT 1991


Most physical data is only good to single precision, so 32 bits is ample
to store it; more is wasteful.  So a 32-bit storage format will always be
useful.  

Some 32-bit operations will always be faster than 64-bit; division and
sqrt in particular in hardware, and elementary transcendental functions
in software.  So there will be performance reasons for continuing to want
32 bits.

Since the data is good to 32 bits, most algorithms could use 32-bit arithmetic
most of the time, with occasional extensions to higher precision at critical
places.  However recognizing the critical places in complicated programs is
not so easy, so it may be cheaper to simply to pay for the hardware than to
pay for the mental analysis.   Using 64-bit arithmetic on complicated programs
makes many of them work acceptably well despite roundoff without any further
analysis.  So that's what's often done.

Even so there are sufficient applications for 32-bit arithmetic, such as graphics
and signal processing, that I think that it will be advantageous to provide for
a while.  64 and 128-bit buses can be used to move vectors of 32-bit numbers
or 32-bit complex numbers.

On the other end, it seems to happen often enough that people have strung
together several unstable algorithms so that 64-bit arithmetic is insufficient.
Furthermore when data is known to better than 32-bit precision, it may be
helpful to have 128-bit arithmetic for the critical parts of those algorithms
that are mostly in 64-bit arithmetic.  And interval arithmetic can burn up
precision very quickly too.  Sun France once nearly lost a deal over 128-bit
arithmetic, contemplated building a coprocessor and extending a compiler
to handle it, but in the end got the deal by having former employee 
Andre Lieutier analyze the problem and suggest a cheaper solution.  Turns
out the customer was doing linear least squares in the worst possible way.
Nothing new about that - twenty years ago when I was a "math software 
consultant" at the UCB computer center I quickly learned that whenever anybody
asked for a double-precision matrix inversion routine (for the CDC 6400)
they were really doing linear least squares in the worst possible way.

Anybody considering implementing 128-bit extended or quad should consider
whether it would be better to provide variable-precision floating-point
instead (implemented in software with hardware support like 64x64->128
integer unsigned multiply, 64+-64 unsigned integer addition with carry
in and out, 128/64->64,64 unsigned integer quotient and remainder, etc.)
but I think that 90% of the need can be handled with 128-bit quad of
106 or more significant bits so that products of doubles can be held
exactly... and the necessary instructions are a lot easier to 
understand and justify in a RISC paradigm without hypothesizing about
language extensions necessary to adequately support variable-precision
floating point.



More information about the Numeric-interest mailing list