Write-up of my project on the multiply-add instruction

Tue Oct 29 13:59:54 PST 1991

            THE BENEFITS OF IBM'S AND HP'S MULTIPLY-ADD INSTRUCTION
                              by Samuel Figueroa
                               October 29, 1991

(The author performed the work reported here in August 1991 while employed by
Sun Technology Enterprises.  The author is now attending New York University. 
He can be reached by sending electronic mail to: figueroaacs.nyu.edu)

SUMMARY

The goal of my project was to attempt to determine what flavor of multiply-add
instruction is better: IBM's fused version, where the quadruple precision
result of the multiplier is fed into the adder, producing a double precision
sum with rounding for the entire operation performed only once; or HP's
version, which allows the FPU to behave as a superscalar processor in certain
cases in which the addition does not depend on the multiplication and vice
versa.

After conducting a shallow investigation, I feel that IBM's fused version of
multiply-add is the better one, though not because of the increased accuracy it
affords.  Rather, the reason is that IBM's instruction has the effect of
reducing the latency of multiplication and addition, so that these two
operations can be alternated in very quick succession.  HP's instruction, on
the other hand, is of no use unless there is enough parallelism in the code,
such as when there is a loop which can be unrolled.  Increased parallelism also
provides IBM's FPU opportunities for improved performance, but IBM's
instruction can be used in a wider variety of situations.

INTRODUCTION AND GOAL

The goal of my project was to attempt to determine what flavor of multiply-add
instruction is better: IBM's or HP's version.  IBM's multiply-add instruction
multiplies two double precision operands, producing a quadruple precision
product to which a third double precision operand is added, producing a double
precision sum with rounding for the entire operation performed only once. 
Thus, this instruction has 4 double precision operands.  In IBM's
implementation of this instruction, the multiplier and the adder cannot be
separated: if only multiplication is required, a 0 is fed in as one of the
inputs to the adder; if only addition is required, a 1 is fed in as one of the
inputs to the multiplier.

HP's version, on the other hand, is simply a means to keep both the multiplier
and the adder busy simultaneously.  This instruction has 5 double precision
operands: 2 source and 1 target operand for the multiplier, and 1 source and 1
source/target operand for the adder.  This instruction allows the FPU to behave
as a superscalar processor in certain cases in which the addition does not
depend on the multiplication and vice versa.

Among other things, IBM claims that the increased accuracy of their version
allows the elementary functions to be implemented in such a way that they run
faster than they would otherwise run [1, pp. 38 and 42].

METHODOLOGY

In order to investigate this matter, I implemented Peter Tang's algorithm for
the exponential function [2] using standard FORTRAN.  This algorithm requires
arithmetic that conforms to the IEEE standard for binary floating-point
arithmetic, and has an error bound of no more than 0.54 ulps (units in the last
place) when the result does not underflow.  (A program written by Alex Liu
reported the actual error bound of my implementation to be within this
theoretical bound.)  The source code for my implementation appears at the end
of this write-up.  My hope was to then measure the number of clock cycles
required on both the IBM and HP workstations, and analyze the benefits each
version of the instruction might have to offer to this particular algorithm.

Unfortunately, my investigation was hampered by two insurmountable obstacles:
lack of a FORTRAN compiler that generates the multiply-add instruction on the
HP workstation, and lack of time to learn the assembly languages of both the
IBM and HP workstations.  The latter is important because otherwise the run
time performance of my implementation tends to be too slow compared to that of
the supplied functions in the vendors' libraries.  For example, the run time of
my original FORTRAN implementation is approximately 1.7 times that of the
exponential function in Sun Technology Enterprises' libm.  When I replaced a
FORTRAN intrinsic function with equivalent arithmetic, the run time decreased
to 1.3 times that of libm.  I then took the assembly language generated by the
FORTRAN compiler, removed a few useless instructions (which reduced the run
time by just a few clock cycles), and reordered some of the remaining
instructions (this mainly consisted of moving instructions from later basic
blocks to earlier ones in order to fill up delay slots).  On a Sun SLC my
hand-optimized version is barely faster in some intervals (by something like 2
or 3 clock cycles, as far as I can tell) than the exponential function in libm.
(My version is NOT faster over all intervals, nor is it guaranteed to be faster
in certain intervals on all Sun workstations.)

On the IBM workstation I did not fare so well.  When argument reduction is not
needed, my implementation runs twice as slow as IBM's, and when argument
reduction is needed, my implementation runs three times as slow as IBM's. 
Performance on the IBM workstation is more sensitive to the quality of
optimization than current Sun workstations due to the former using a
superscalar processor, so I imagine hand optimization of the assembly language
code could have a significant impact on the run time of my exponential routine.
It is possible that IBM's exponential routine uses a different algorithm that
is especially attractive for their architecture, in which case Tang's algorithm
might not have the potential to perform as well as the algorithm IBM's
implementation uses.  (Indeed, according to [1], IBM uses a different algorithm
based on a table whose data points are not quite evenly spaced.  This algorithm
is capable of producing function values that are correctly rounded at least
almost all the time.  [As far as Dr. Liu's program was able to tell, IBM's
exponential routine delivers correctly rounded results.]  [1] mentions a
specific place where the accuracy of IBM's instruction is beneficial in this
algorithm.)  Until someone takes the time to hand optimize my routine, I cannot
know for sure if it makes sense to compare the performance of my routine with
IBM's.

ANALYSIS

The steps in Tang's algorithm are as follows:

   1. Filter out exceptional cases.

   2. Reduce the argument to the interval [-(ln 2)/64, (ln 2)/64] if necessary.

   3. Compute the value of a 6th-degree polynomial p(y), where y is the reduced
      argument belonging to the interval above.

   4. Compute the value of the function using the information obtained from the
      steps above.

Obviously, the multiply-add instruction would be useful only in situations
where both multiplication and addition must be performed.  In Tang's algorithm,
these situations occur in the last three steps.  Let's consider each step one
at a time:

   1. When reducing the argument, one computes the following expressions:
                           yhigh = x - n*(ln2high/32.0)
                           y = yhigh - n*(ln2low/32.0)
      where x is the argument to the function and n is a previously computed
      integer.  Here HP's multiply-add instruction can be used at most once: 
      after the first product has been computed, the second product can be 
      computed simultaneously with the first subtraction.  (The argument x is 
      not needed in the remainder of the code.)  IBM's multiply-add instruction 
      can be used twice, once for each expression.  It may occur to oneself 
      that the increased accuracy of IBM's instruction would be beneficial here 
      (as is reported in [1, p. 38]).  Unfortunately, the accuracy of the value 
      for (ln 2)/32 must be greater than double precision.

   2. Horner's scheme is used to evaluate the polynomial.  If one evaluates the
      even and odd terms separately, HP's multiply-add instruction can be used 
      at most three times, whereas IBM's multiply-add instruction can be used 
      five times.  Again, the increased accuracy of IBM's instruction hardly 
      helps, since presumably all 6 terms are needed.  Furthermore, the error 
      bound for evaluation of this polynomial is barely over 0.5 ulps, and the 
      error bound cannot go below 0.5 ulps.  And even if some operation could 
      be eliminated (I didn't do the error analysis carefully enough, but it is 
      certainly not possible to eliminate more than an addition), this step 
      would not execute any faster.

   3. HP's multiply-add instruction cannot be used when computing the function
      value, while IBM's instruction can be used once.  The increased accuracy 
      of IBM's instruction does not help because none of the operations can be
      eliminated, and anyway this step already introduces a very small amount 
      of error.

CONCLUSION AND WORK NOT DONE

After conducting this shallow investigation, I feel that IBM's version of
multiply-add is the better one, though not because of the increased accuracy it
affords.  I was unable to find a single place in Tang's algorithm where this
would be beneficial. Rather, the reason is that IBM's instruction has the
effect of reducing the latency of multiplication and addition, so that these
two operations can be alternated in very quick succession: when successive
multiply-add instructions are executed, they can be issued every clock cycle if
they do not depend on the immediately preceding instruction, and every other
clock cycle otherwise.  HP's instruction, on the other hand, is of no use
unless there is enough parallelism in the code, such as when there is a loop
which can be unrolled.  (There are no loops in Tang's algorithm.)  Increased
parallelism also provides IBM's FPU opportunities for improved performance, but
IBM's instruction can be used in a wider variety of situations.

I was not able to do everything I would have liked to.  For one, I did not have
time to hand optimized my implementation on the IBM workstation, nor was I able
to run a version of my implementation using the multiply-add instruction on the
HP workstation.  It would be interesting to try to figure out how IBM's
implementation works.  (I disassembled their routine using adb.)  Finally, I am
waiting for someone at IBM to explain to me why the increased accuracy of their
instruction is beneficial.

ACKNOWLEDGEMENTS

I received help and moral support from the Floating-Point Group at Sun
Technology Enterprises: David Hough, K. C. Ng, Alex Liu, Keith Bierman,
Marianne Mueller, Gregory Tarsy, and Shing Ma.

REFERENCES

1. B. Olsson, et al, "RISC System/6000 Floating-Point Unit," [I don't know for
   sure where this paper appeared, but it seems very likely that it appeared 
   in:  IBM RISC System/6000 Technology, SA23-2619, IBM Corporation, 1990,] 
   pp. 34-42.

2. P. T. P. Tang, "Table-Driven Implementation of the Exponential Function in
   IEEE Floating-Point Arithmetic," ACM Trans. on Math. Soft., vol. 15 no. 2 
   (June 1989), pp. 144-57.

SOURCE CODE

[The source code is available upon request from the author.]