recent postings to comp.arch

David G. Hough on validgh dgh
Tue Jun 18 09:41:06 PDT 1991


Some of these may be of general interest.
If you are on numeric-interest and don't read comp.arch and appreciate
seeing these repostings, let me know; if few people so respond then I
won't do it any more, since many people on numeric-interest do read comp.arch.

===============================================================================
 From postnews Sun Jun  9 21:36:58 1991
Subject: Benchmarking performance vs. robustness
Newsgroups: comp.arch

There's been an ongoing discussion of issues of performance vs. robustness in
comp.arch.  How much is it worth to improve the performance on the 
"common" case at the cost of producing wrong results on "uncommon"
cases?  "uncommon" is not necessarily rare, of course.

IEEE 754 floating-point arithmetic was intended to increase the domain
of problems for which good results could be obtained, and to increase the
likelihood of getting error indications if good results could not be obtained,
all without significantly degrading performance on "common" cases and
without significantly increasing total system cost relative to sloppy
arithmetic.

Since none of these goals are very quantitative, 
people will argue about how well they've been achieved.

Part of the problem is that the benchmark programs in use measure only
common cases.  Various versions of the Linpack benchmark all have in common
that the data is taken from a uniform random distribution, producing problems
of very good condition.   So the worst possible linear
equation solver algorithm running on the dirtiest possible floating-point
hardware should be able to produce a reasonably small residual, even for
input problems of very large dimension.

Such problems may actually correspond to some realistic applications, but many
other realistic applications
tend at times to get data that is much closer to the boundaries of
reliable performance.   This means that the same algorithms and input data
will fail on some computer systems and not on others.   In consequence
benchmark suites that are developed by consortia of manufacturers 
proceeding largely by consensus, such as SPEC, will tend to exclude
applications, however realistic, for which results of comparable quality
can't be obtained by all the members' systems.

The user's own applications is always the best benchmark, but because of
practical difficulties it makes more sense for consortia of users in various
industries to specify in a general way the problems to be solved in terms
of input data, sample algorithms, and a measure of correctness of the computed
results, and then allow fairly radical recoding of the algorithms to solve
the problems on particular architectures.  This is more like the way the 
PERFECT club works than like SPEC.   The most important thing is that the
published performance and correctness results should be accompanied by the
source code (and Makefiles etc.) that achieves them so that prospective
purchasers can determine for themselves the relevance of the results to their
particular requirements.  Most users place a great premium on NOT rewriting
any source code, but others are willing to do whatever is necessary to get
their jobs done.

The rules for the 1000x1000 Linpack benchmark, for instance, are that you
have to use the provided program for generating the input data and testing
the results, but you get to recode the linear equation solution itself,
the part that gets timed, in any way that makes sense for a particular
architecture.

===============================================================================
 From postnews Thu Jun 13 06:23:01 1991
Subject: logarithmic number representation
Newsgroups: comp.arch

Several people have commented that IBM is now experimenting with logarithmic
representations for floating-point numbers.   This is not new; people have
been doing that for a long time.  The proceedings of any IEEE Symposium on
Computer Arithmetic will have several papers on this subject.  The next
symposium is later this month in France.

Nobody has found any worthwhile advantage for logarithmic numbers for general-
purpose computation, but I have heard that they have been used in some
special-purpose embedded systems.

X3J11 adopted a conventional floating-point number representation as part
of its standard for ANSI-C.  I pointed out that there was no way that
<float.h> could be meaningfully defined on a system with logarithmic numbers,
and the committee's response was basically that was OK, those systems just
wouldn't support ANSI-C.

===============================================================================
 From postnews Thu Jun 13 06:50:12 1991
Subject: Re: IEEE arithmetic
Newsgroups: comp.arch
References: <9106120131.AA20868aucbvax.Berkeley.EDU>

Responding to my earlier comments,
in article <9106120131.AA20868aucbvax.Berkeley.EDU>, jbsaWATSON.IBM.COM writes:
> 
>> Various versions of the Linpack benchmark all have in common
>> that the data is taken from a uniform random distribution, producing problems
>> of very good condition.   So the worst possible linear
>> equation solver algorithm running on the dirtiest possible floating-point
>> hardware should be able to produce a reasonably small residual, even for
>> input problems of very large dimension.
> 
>          This is untrue.  Consider pivoting on the column element of
> smallest magnitude while using IEEE single precision.  I don't believe
> n has to be very large before you are in big trouble.

I should have said "the worst PROBABLE algorithm" - no pivoting at all rather
than agressively choosing the worst possible pivot.  It ought to be possible to
try this out by disabling the pivot selection in the Linpack benchmark.

>> This means that the same algorithms and input data
>> will fail on some computer systems and not on others.
> 
>          If you know of a single realistic application where the diff-
> erence between 64 bit IEEE rounded to nearest and 64 bit IEEE chopped
> to zero makes a significant difference in the maximum performance ob-
> tainable I would like to see it.

I doubt you'll find anything like this in normal scientific computation
- the difference in problem domain between floating-point systems with
identical exponent and significand fields and rounding schemes
comparable in cost and care is rarely likely to be extremely
noticable.  I would expect to notice the difference in problems like
orbit computations, however - rounding should yield acceptable results
for longer simulated times than chopping, even if everything else is
equal, simply because the statistics of rounding are more favorable.  I
have also heard that the details of rounding are important in financial
calculations of bond yields.

More to the point of benchmarking defined by vendor consortia, the
differences in applicable domain between 32-bit IBM 370 format and
32-bit IEEE, or between 64-bit Cray format and 64-bit IEEE, or even
between HP and TI hand calculators with 10 displayed digits, has been a
part of the experience of a number of comp.arch readers.   In the two
former cases you can often get by with twice as much of the dirtier
precision.  In the Cray case that's much slower, however, and may
eliminate most of the supercomputer's advantage over a workstation.
The IBM 370 case may be slower too if the problem involves much more
than +-*.

===============================================================================
 From postnews Thu Jun 13 09:31:50 1991
Subject: Linpack benchmark condition; antipivoting stability
Newsgroups: comp.arch
References: <9106120131.AA20868aucbvax.Berkeley.EDU> <394avalidgh.com>

James Shearer caught a mistake in my previous posting about the Linpack
benchmark.  The point of my posting was that the benchmark input data
was taken from a uniform random distribution, and so the problem was very
well conditioned, and practically any algorithm - and any kind of arithmetic - 
would produce satisfactory results.  

Shearer noticed that I had overstated the case somewhat, since an algorithm
that intentionally sought minimal rather than maximal pivot divisors was apt
to do rather poorly, so I had to correct my statement to practically any
algorithm that was probably going to be implemented.  I was curious to see
how the details worked out in practice.

I modified ISAMAX in the standard Linpack benchmark code to optionally
choose the minimum-magnitude pivot rather than the normal maximum-magnitude,
or to choose the first candidate pivot which is equivalent to doing no
pivoting.  A 100x100 single-precision or double-precision
matrix was enough to confirm that the
minimum-magnitude algorithm produces garbage results.

However the NOPIVOT option didn't do badly and produced results that might
be acceptable for some uses:

dimension	method	minimum normalized	error
			pivot	residual	x-1		

100 single	normal	2	1.6		1e-5
100 single	nopivot	.13	70		3e-4
1000 single	normal	2	11		2e-4
1000 single	nopivot	7e-2	2600		8e-2
1000 double	normal	2	10		5e-13
1000 double	nopivot	6e-2	3900		6e-11

The Linpack benchmark is intended to solve an equation A x = b
where the elements of A are chosen from a uniform random distribution
and b is computed so that the correct solution x is close to a vector
of 1's.  So looking at the size of components of x-1's is supposed to 
indicate how accurate the solution is, although
the residual b - A x is a more reliable measure of quality.

The foregoing results came from a SPARCstation.  The bottom line is
that even with no pivoting, Gaussian elimination 
on a 1000x1000 matrix of uniform random numbers will lose only 2 or 3
digits in the result compared to partial pivoting.   Any consortium-
derived benchmark suite that used such a program probably wouldn't
even notice - for instance,
SPEC supplies special tools for validating results that ignore
the least significant digits.

===============================================================================
 From postnews Thu Jun 13 14:08:18 1991
Subject: More on Linpack pivoting: isamax and instruction set design
Newsgroups: comp.arch

Refering to isamax in my previous Linpack posting brought to mind
the quandary that it represents for computer instruction set architects.

I believe that Cray first noticed that when LU-factoring matrices of dimension
n, although saxpy operations
( X += c * Y for scalar c and vector X and Y) are n times more frequent
than isamax operations (find i such that Xi has largest magnitude in X),
you can gain a lot more from agressive vector hardware design and compiler
technology on saxpy than on isamax, so that eventually isamax becomes
the bottleneck.  The heart of isamax is

      do 30 i = 2,n
         if(abs(dx(i)).le.dmax) go to 30
         isamax = i
         dmax = abs(dx(i))
   30 continue

What kinds of additional
instructions belong in an instruction set architecture that is
intended to be scalable, from one-chip systems with inexpensive memory,
to very high performance systems implemented with multiple paths to memory
and perhaps multiple functional units (integer, float, branch)
that may, however, be relatively distant, so that conditional
branches become quite expensive?
A simple max/min won't help.   How can multiple comparisons be overlapped?
Can the "abs" be concealed?

===============================================================================
 From postnews Tue Jun 18 08:48:18 1991
Followup-To: poster
Subject: Condition number of uniform random matrices; residuals
Newsgroups: sci.math.num-analysis,comp.arch

jbsaibm.com asserts about solving linear equation with
matrices composed of elements from a uniform random distribution:

> Problems
> of size 1000 appear to typically have condition number between 10**5
> and 10**6.

I haven't observed any evidence to support this.  If uniform random 1000x1000
matrices had such condition numbers then it would be impossible to solve
them accurately with Gaussian elimination.  I did report that 
gaussian elimination without pivoting was somewhat unstable on matrices
of that size, but that only reflects on the stability of that method,
rather than the underlying condition of the problem, independent of the
method.

For instance, the normalized residual computed in the linpack program,

        RESIDn = RESID/( N*NORMA*NORMX*EPS )

grows very slowly; typically 1.5-2 for 100x100 systems, and around 10
for 1000x1000 systems, using IEEE arithmetic and conventional partial
pivoting.  So the growth of RESID itself is only slightly faster than linear 
with N.  Of course the size of the residual b-Ax doesn't tell you too much 
about the size of the error in x, in general.   Luck is with us in this case,
however: for single precision matrices, the measured difference in the 
computed x's vs. the nominally correct x's whose elements are all exactly
one is in the same ratio, e.g. about 1e-4 for 1000x1000 vs. 1e-5 for 100x100.
Of course the actual correct answer for this problem isn't a vector of 1's
since the right hand side b was computed by rounding the result of computing
A*(1 1 1 1 1 ...) and so the computed b differs from the one corresponding
to the vector of 1's, and correspondingly the correct x corresponding to
the right hand side b actually used is somewhat different - but fortunately
that doesn't matter much for this well-conditioned problem.

But this is a good illustration of a point I made previously: in general
you don't know the error, because that would be tantamount to knowing the
answer, and most people don't spend most of their time solving problems
with known answers.  So the normalized residual is a good indication of
whether an algorithm has solved a linear equations problem about as well as
can be expected in a particular floating-point precision.  If an answer
with a small error bound is desired,
regardless of the condition of the problem, something else will have to be
done.

This reminds me of a problem that was posed to me several days in advance
of my qualifying examination:

A very eminent applied mathematician and one of his students
had submitted a paper to a prestigious
journal showing experimentally that all the theory about iterative 
improvement must be wrong.

(Iterative improvement is a method for improving the accuracy of the answer
x for linear equation problems that are moderately ill-conditioned.
When the residual can be computed in higher precision, the accuracy of x
can often be improved to nearly full working precision. )

The authors had computed solutions to problems from
several collections of test matrices of varying conditions
and to a number of random matrices and had seldom seen iterative improvement
yield more than a digit of improvement.

What was really going on here?  I couldn't figure it out before the
qualifying exam, but for better or worse I passed anyway.

===============================================================================
 From postnews Tue Jun 18 09:11:59 1991
Subject: Re: IEEE arithmetic
Newsgroups: comp.arch
References: <9106162326.AA03555agrape.ecs.clarkson.edu>

jbsaibm.com says:

> I believe IEEE chopped will be easier to implement and perform
> better than IEEE rounded. 

This is true, but only marginally.  The most important cost in IEEE (or VAX)
floating-point arithmetic is getting the correct unrounded result,
or at least enough of it to permit correct rounding.  This requires
provision of guard and (for chopping or unbiased rounding) sticky bits, 
postnormalization, 
development of full doubled-precision products,
careful division and sqrt, etc.

Once you've gotten the unrounded answer, rounding (as opposed to chopping) 
costs an extra add time and dealing with the possibility of a carry out
of the most significant bit.  Although these are real design costs, I doubt
they've affected performance of modern pipelined IEEE designs compared to
the cost of getting the unrounded result.

I suspect that the reason IBM 370 arithmetic isn't correctly chopped is 
that the cost of doing it correctly - maintaining a sticky bit for
subtraction - was deemed too high, and the computed result is more accurate
in those cases where rounding is done instead of chopping.  More accurate,
but not in a useful way, however - the size of the
error bounds you're likely to derive
will have to reflect chopping, but any algorithms that might like to depend
on correct chopping principles ( (x-z) < x if z is positive) will have to
do without.

In general, any process that involves repeated inexact conversions,
particularly binary->decimal->binary->decimal, or addition of many small
inexact quantities, will benefit from the better statistics of rounding
over chopping - drifting tendencies will be reduced, and error bounds
will be smaller.

For instance, during the 1970's one of the minor regional stock exchanges
noticed that its market stock average was declining in a bull market in which
most of the component stocks were rising in value.   There were a number
of factors at work in this phenomenon; using chopping arithmetic was
one of them.

Using enough precision can sometimes overcome deficiencies of chopping or
dirty arithmetic, otherwise Seymour Cray would have done things differently.
But it isn't always enough, as most Cray customers find out occasionally. 

===============================================================================
 From postnews Tue Jun 18 09:30:04 1991
Subject: Implementing Interval Arithmetic with IEEE rounding modes
Newsgroups: comp.arch

Dik Winter observes that dynamic rounding modes (that can be varied
at run time) don't have to be quite so expensive on pipelined systems
if the control modes can be written without emptying the pipeline.

If you don't have such a feature then the cost of interval addition
is dominated by the cost of emptying the pipe and the cost of the two
additions is minor.  

However several existing instruction set architectures, including SPARC for
which I can take some blame, combine too much stuff into the floating-point
control/status register, so that it's unrealistic to expect
the hardware to keep track of changes to the rounding modes only... so
the pipeline will necessarily have to be drained.  In addition the SPARC
architecture specifies that changes to the floating-point control/status
register don't have to take effect for a couple of cycles after the load,
with no hardware interlock support,
so you have to force some nop dead time before you can initiate a rounded add.
Other RISC architectures may be similar.  The common CISC architectures
have separate control and status registers, and at least the microcoded
implementations of floating-point +-*/ are so slow that control register
manipulations are of no consequence.   This was the environment anticipated
when IEEE 754 was being drafted, leading to its expectation (though not a
requirement) that rounding modes be implemented dynamically in hardware.

So for satisfactory interval arithmetic performance, I've come to the
conclusion that one of the following paths must be followed:

1) separate control and status registers and appropriate hardware support

2) additional instructions with static rounding modes built in -
	+-*/ rounding to +infinity
	+-*/ rounding to -infinity

3) microprogrammed instructions, perhaps in a coprocessor, to perform
	interval operations directly 




More information about the Numeric-interest mailing list