No subject

Fri Feb 21 12:59:19 PST 1997

> 
> 	From: "Russell L. Carter" <rcarteraconsys.com>
> 
> 	Hmm, I've run a lot more x86 code through LAPACK than just the
> 	test and timing problems and have never had a problem.  This is
> 	using various versions of gcc on various PC unices.  I did have
> 	a problem with some of the eigenvalue test routines, but since
> 	I didn't need them, I ignored them.
> 
> This is anecdotal evidence.  That a package works for one happy user
> (or mostly happy in this case) is little consolation to users for whom
> it doesn't.  The objective way to test a package is to torture it on
> an official rack, not on one's own software.

David supplied anecdotal evidence, so did I.  I suppose each conclusion
has equal value.

> 
> 	Should 95% of the people on the planet either have their
> 	current Java machine run slower by at least a factor of 2 or
> 	replace their current machine with Sun approved hardware in
> 	order to make sure that broken programs do not arouse suspicion
> 	by running (slightly) differently on different available
> 	platforms?...  Those who know x86 performance are probably
> 	wondering why I only quote a factor of 2.  You're right, it's
> 	likely a bigger hit than that.
> 
> Apparently the performance impact of suboptimal floating point is
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^
> highly variable.  Intel arrived at their celebrated 27,000 year period
> for encountering the Pentium division bug by estimating that in typical
> use a spreadsheet would perform 1,000 divisions a day.  With that
> frequency, or even a much higher one, one would not see a noticeable
> slowdown with Mathworks' division bug workaround installed to eliminate
> the problem.

Suboptimal floating point?  The Pentium fdiv bug was suboptimal, the
various flavors of Cray arithmetic were surely suboptimal, original IBM
360 subtract-magnitude-with-no-guard-digit was suboptimal, but it's
really hard to see how producing more accurate floating point results,
faster, is suboptimal.

In most practical codes, divisions are a rarity, *especially* in high
performance floating point codes.  Most cpus have terrible divide
performance.  Ordinarily, additions and multiplications vastly 
outnumber divisions.  Studies are available that count the 
number of multiplications and divisions in interesting codes, 
compliments of the Cray hardware performance monitor.

Now, what does the "workaround" really mean?  Coonen's report describes
the problem, but it appears many do not understand the crux.  Both the
P5 and P6 can be programmed to approach perhaps 2/3 of the peak
performance of 1 FLOP/cycle.  This is done through careful management
of the 8 register stack, so that the 8 stage floating point pipeline
is kept busy.  The stack can be manipulated with no delay due to
instruction pairing.  As long as there are no "extra" loads or stores the
pipeline can be kept mostly filled.  If after each floating point
operation the result must be written back to and read from memory, 
the pipeline drains and there is significant stall before the next
floating point operation completes.  I have written a number of versions
of blocked DGEMM in x86 assembly code and I would guess a factor
of 10 maybe higher performance loss writing and reading after each flop.
Now none of this addresses locality of reference or main memory bandwidth
in the system, which is why I suggest a factor of 2.

[ this just in:  I have read Sam Figueroa's note and *he* suggests
  a factor of 10 ]

> 
> Now we are being told that a workaround correcting for the vagaries of
> 80-bit extended precision will slow Java programs down by a factor of
> 2.  It is hard to believe that Java programs are *that* much more
> numerical than spreadsheet programs.

The issue is floating point in all programs, not just spreadsheets.  I would
like to know how it is proposed to do geometrical transformations or
image processing in multimedia applications without floating point.

> 
> Apparently people make up whichever extreme statistics support their
> case as they go along, and have no shame when the two extremes differ
> by however many orders of magnitude it takes to get from 1000 divisions
> a day for a spreadsheet to a factor-of-two slowdown for a Java applet.

:-)  

2x or 1.02x, that surely is the question...

> 
> Carter's overall argument is part technical and part polemical.  That
> the above two flaws occurred in the technical parts, which I could
> understand, greatly undermined for me the credibility of the polemical
> parts, which I could not.
> 
> Vaughan Pratt
> 

Oh well.

Cheers,
Russell