Comments on quad precision

Thu Jun 6 13:47:37 PDT 1991

Since the numeric-interest mailing list seems to be in story
mode, I thought I might add a story of mine to the list, which
is relevant to the issue of quad (128 bit) precision.
My work at NAS has focused on the performance evaluation of 
parallel supercomputers, and inadvertently has rubbed up against
a specialty of this list, the (in)accuracy of floating point arithmetic.

We operate two Crays at NAS; a Cray 2 and a Y-MP.  It seems that
one of our users achieved significantly different numerical results
for the same calculation performed on the two machines.  The 
calculation is parallelized variant of Cholesky decomposition,
followed by back substitution, of a symmetric positive definite
matrix arising from a structures problem.  The matrix is of order 
16146.  The interesting calculated quantities were the max norm 
of the solution vector and the 2-norm of the residual vector.  
The user had access to a wide variety of floating point boxes: IBM 3090, VAX, 
IEEE, Cray 2, and Y-MP.  The problem was unaffected by parallelization.

He found that the max of the solution vector decreased with decreasing
mantissa length (56>=56>53>48>=48), while the residual increased.
So far so good, a check of the condition number, which turned out
to be > 10^11, is consistent with this behavior.  Unfortunately,
there was a significant difference between the Crays, with the Y-MP
having two *decimal* digits worse accuracy than the Cray 2.  Since
the Crays nominally have the same floating point architecture (but
not exactly!)  and in particular the same mantissa length, and
the Y-MP had a month previously been certified by my supervisor
as having passed the Government's 30 day acceptance test during which 
the hardware was thought to have worked correctly, there was considerable
consternation and activity concerning this result.

A bit of bureaucratic manuevering simplified the abstract contractual
problems, but to guard against future surprises I was encouraged to
investigate the anomaly further.  Having had a course in floating
point arithmetic based on Sturbenz, I had a little idea about what
to look for, but fortunately, I didn't have to think very long,
because who else but Kahan began to bang on the door (well actually 
by phone) demanding to be told about the problem.  I did so, and
spent the rest of the day running experiments that discredited
my initial theories about where the problem was located, to Kahan's
delight.  Where upon Kahan declared that he knew what the problem
was, that I should stay away from it (!), and that it had to do
with the the Y-MP's subtraction operation, and the positive 
definiteness of the matrix (a clue!) which makes the diagonal entries
of the Cholesky factors a little larger.

To make a long story short, consultation of Cray manuals disclosed
that while neither Cray has a guard digit for the subtract
magnitude operation, the Cray 2 has a rounded preshift, while
the Y-MP does not. The effect is that the Cray 2 produces a subtract
magnitude result that is uniformly bad about the correct result, but
the Y-MP always produces a result that if it is in error is larger
than the exact floating point result.

Examination of the algorithm and code showed that the diagonal 
entries of the Cholesky factors are computed from subtract 
magnitude operations, and I performed an analysis
for a test matrix obtained from the discretized laplacian on
a square domain that provides bounds for difference in the interesting
quantities produced by the performing the calculation in
the two arithmetics that is consistent with the data, and Kahan's
suggestion.  Additionally, I had a summer intern solve a large 
number of test cases using the LINPACK Cholesky routine that produced 
data consistent with the model.  So this slight difference in the 
subtraction operation of the machine can conceivably render 
unreliable certain calculations of interest to NASA.

Now, a simple approach to improve the reliability of this computation
would be to carry out the calculation in Cray double (128 bit) precision.
Unfortunately, this approach is not viable for us for two reasons.  The
first is that doubling the required amount of storage is bad because
we are already trying to solve problems that tax the memory limits
of all available computer systems.  The example problem I mention
requires about 80 Mbytes of core memory, and we would like to
solve problems 10-100 times larger.  Add to this the fact that
parallelizing calculations nearly always significantly, if not
drastically, increases memory requirements, and it's obvious that
128 bits is not the way to go.

The second problem is the obvious one:  128 bit calculations are
slower on all machines, and waiting days for an accurate answer is
not good.

Now, as matrices scale larger, condition numbers scale larger as well
and we became worried that we might in the future buy a system
that could carry out the required floating point operations to solve 
larger problems in a reasonable amount of time, but with little or 
no accuracy.

About this time I began familiarizing myself with the LAPACK project,
and in particular, the new error analyses that the new algorithms
took advantage of.  One of the results of this work is that iterative
refinement is often very valuable when applied to factors obtained
from less stable methods than classic Gaussian elimination with
partial pivoting.  The nice result is that double precision is not
necessary for calculation of the residual, so it can be very fast
on a machine like a Cray which has abominable 128 bit floating
point performance.  I applied this "single precision iterative refinement"
to the Y-MP Cholesky factors, and voila, achieved accuracy equivalent
to the Cray 2 with one iteration (a trivial amount of cpu time).
So in summary, this is another case where application of numerical 
mathematics has solved a problem caused by the limitations of computer
hardware.

Over the last year or so I have looked for problems of interest to
our user community (linear systems arising from discretizing pdes)
that exhibit sensitivity to word size like the
problem I described above, and I haven't found any others.  If anyone
on this list has interesting counterexamples, I'd love to see them.  

Thus I'm led to believe that 64 bit (even dirty!) floating point is adequate
for NAS CFD work for at least the next 10 years or so, and we won't 
be putting specifications for 128 bit hardware floating point into 
our procurements anytime soon.  I do think that having quad precision 
can be very useful for diagnostic purposes, in particular I used it 
extensively while analyzing the above problem, and I hope it is 
available on future systems.

Russell Carter
Numerical Aerodynamic Simulation
NASA Ames Research Center
rcarteranas.nasa.gov