advantages of fused multiply-add, and a thesis project

David Hough sun!Eng!David.Hough
Mon Jun 17 16:56:19 PDT 1991


Recalling the previous discussion of fused multiply-add, which computes
(a*b)+c correctly rounded (i.e. with only one rounding error), I'd like
to inquire as to whether anybody on this list knows of situations in 
which the presence of fused multiply-add (IBM RS/6000 style)
provides a significant accuracy
advantage or a significant performance advantage -
relative to a chained multiply-add (PA-RISC style)
which dispatches both operations in one instruction, but with two roundings;
all other factors being equal.

Note that -

1) fused multiply-add with one rounding error
always provides a lower error bound,
relative to doing two operations with one rounding error each.

2) fused multiply-add with one instruction dispatch
often provides a performance advantage, relative
to dispatching two separate instructions.

So the interesting questions must be relative to a chained multiply-add
that dispatches two instructions in the same cycle.

How can fused multiply-add offer better performance than chained multiply-add?
Only when the lower error bound allows the desired quantity to be computed
with fewer instructions, or allows an iterative process to terminate after
fewer iterations.  

An example of computing with fewer instructions is computing doubled-precision
products of working precision data:

	p1 = x * y
	p2 = x * y - p1

Here p1+p2 == x*y exactly with fused multiply-add, barring exceptions.  
To get the same result from chained multiply-add requires many more steps:
dividing x and y into upper and lower pieces that can be multiplied
together without rounding error, doing the multiplications and adding
up the partial products.   Even better, of course, is to implement a doubled-
precision product in hardware.

An example of stopping an iterative process sooner is performing division 
or sqrt by Newton iterations; the last step involves forming a product and
adding it to the previous iterate.   The fused multiply-add allows one
fewer iteration on IBM RS/6000 than chained multiply-add would. 

While both these examples are interesting, they may not convince people
who have already decided to implement higher precision in hardware or who
implement division and sqrt in hardware.  Even the doubled-precision product
from working precision operands is only one piece of doubled-precision product
from doubled-precision operands, and fused multiply-add doesn't help much
with getting the doubled-precision sum of doubled-precision operands.
(The latter conceptually requires sum of three operands computed with a single
rounding error, or if you prefer, an instruction which yields the "lower half"
of a sum:

	p1 = x + y
	p2 = lower(x+y)

so p1 + p2 == x + y.  Another instruction 
lower(x*y) could be used to provide a doubled-precision product.)

Portable benchmark programs that have survived IBM 370, Cray, VAX, 
conventional IEEE, and IEEE extended precision register allocation 
continue to run about as accurately with fused multiply-add.  So I wouldn't
expect much useful information to arise from examing these.   Better accuracy
would be demonstrated mostly on programs that were written 
intentionally to exploit a fused multiply-add if available.

In the IBM RS/6000 system implementation, fused multiply-add is credited with
making the elementary transcendental function implementations faster or more
accurate than they would be otherwise.  This is difficult to assess from
outside: without source code it's hard to tell what the accuracy impact
of a chained multiply-add would be relative to a fused one, or what the
performance impact of a chained multiply-add would be assuming the accuracy
target were fixed.  Again, performance comparisons to unchained multiply-add
or accuracy comparisons to elementary transcendental function libraries with
different accuracy goals don't shed any light.

I have heard that the LAPACK programs exploit fused multiply-add on RS/6000
systems in order to compute residuals faster, but I don't know enough about
these programs to know whether the performance impact is really significant,
overall, compared to what would have been obtained with only chained 
multiply-add.

So any other definite information, especially from end users, would be welcome.
I can't recall seeing any postings of the flavor "my program ran on RS/6000
but not on PA-RISC due to multiply-add" on comp.arch or comp.lang.fortran.
The persons most concerned seem to be some IBM sales people, attempting to sell
fused multiply-add as an advantage, and some non-IBM sales people, attempting to
sell fused multiply-add as a disadvantage.  Has anybody ever been involved
in a situation where presence or absence of multiply-add was a deciding 
factor? 

------------------------------------------------------------------------

THESIS PROJECT:  A timely MS (at least) thesis project would be to

1) Implement 

	float powf(float, float)

a single-precision implementation of the power function,
on some modern IEEE 754 RISC processor.
Requirements are:

	correctly rounded result according to the current IEEE rounding mode
	correctly reported exceptions including inexact
	use only IEEE single (float) and double, and 32-bit int and unsigned
	AS FAST AS POSSIBLE
	as portable as possible

The goal is to make this program behave as much as possible like an atomic
machine instruction.

2) Implement a powf correctness test program that tests for correct rounding and
exceptions, using a different algorithm, on arguments chosen specifically to
test correct rounding and exception handling as well as randomly.

3) Implement a powf performance test program that generates realistic
performance comparisons of different implementations and is immune to simple-
minded compiler optimization tricks.

4) Determine changes in instruction set architecture, reasonable in a RISC
paradigm, and in programming languages, necessary to improve performance
and portability of this kind of program.  How much does fused multiply-add
help?  How much does double precision help?

5) Compare to almost-correctly rounded algorithms like Tang's.

[Explanations: single precision rather than double in order that higher-
precision intermediates can be investigated.  pow() rather than exp() or log()
for which the need for inexact exceptions is arguable.]



More information about the Numeric-interest mailing list