Complex multiply-add

Mon Jan 21 21:25:20 PST 1991

This should teach Joel to post his memoirs!

Abstract:
IBM RiscSystem/6000 floating-point unit is based on an underlying hardware
implementation of what I call fmuladdd, double-precision multiply-add, on
which all other multiply and add instructions are built.  Is the next step
fzmuladdw, double-precision complex multiply followed by quadruple-precision
complex add, all with only one rounding error?  Comments solicited.  Sun
employees can get tbl | eqn | troff -ms source from ~dgh/memo/complex.mul.add,
but they might want to wait until I've had a chance to receive and digest
initial feedback.

     A discussion of three-operand add instructions inspired the following
reminiscence by Joel Boney:

     Well, it may be a lot of opcodes if you are trying to add them to an
     existing architecture. (The following is a little sketchy since it
     has been about 2 years since I worked on the RS/6000 and I don't
     have any detailed documentation at hand.) The RS/6000 however, only
     has one floating point operation and it's "multiply add". All the
     other operations are simply special types of multiply adds (Divide
     is the exception. Its implementation is akin to microcode using
     multiply-adds). So it may only take 1 or 2 bits in the opcode to
     determine a fp operation. Then there are opcode bits that allow any
     of the 3 input operands to be negated. This gives the orthogonal set
     of multiply add operations above.  Subtract is just an add with the
     second operand negated.

     Also, I believe some instruction bits indicate when to replace one
     of the input operands with a "canned value". Thus an add  a + b
     becomes a multiply- add of: a * 1.0 + b.  And A * B becomes A * B +
     0.0. Negate x becomes -x * 1.0 + 0.0. Now I suppose you could call
     some of these bits the opcode bits for add, subtract, negate, etc.,
     but they are really more general and are easy to decode and imple-
     ment in hardware. So with this sort of scheme everything is a
     multiply-add and the instruction decode bits are used to differen-
     tiate negation of input operands and dummy input operands. (If I
     remember correctly there are some  minor problems that have to do
     with getting correctly signed zero results with this scheme. I think
     these were solved with extra logic of some sort.)

     Therefore, I don't believe the RS/6000 uses any more bits for
     instruction decode of floating point operations than does other
     existing architectures.

     Joel's remarks led me to consider what the next logical extension might
be.  I decided that complex multiply-add warranted consideration, but first,
some background.

Current RISC floating-point instruction sets

     The common "vectorizable" codes that thrive on Crays and other supercom-
puters often involve multiplies chained to adds of the product into a register
serving as an accumulator.  These systems typically perform relatively much
poorer on problems which can't be so characterized.

     Most RISC floating-point instruction sets consist of IEEE 754 single and
double-precision, three-address multiply and add instructions, plus a few
more.  The common "vectorizable" codes are impeded in most current implementa-
tions because the multiply and the add require separate instruction fetch and
decode and may not benefit from chaining possibilities.  Superscalar implemen-
tations may alleviate some of this bottleneck, as does chaining hardware that
permits the results of one instruction to immediately enter a subsequent
instruction, while the store to the destination register proceeds in parallel.

     The IBM RS/6000 architecture is based on an instruction set that includes
four-address double-precision multiply-add instructions, computed with only
one rounding error.  That is, the multiply-add result is the same as that
obtained by correctly rounding the infinitely precise correct result of a*b+c
to double precision. In current implementations, all other adds and multiplies
are based on this underlying mechanism.  Even the iterative division algorithm
exploits the accurate multiply-add primitive.

     The RS/6000 implementations achieve outstanding performance in many com-
mon "vectorizable" applications that can exploit the multiply-add primitive
effectively.  For other types of applications, the attained performance is
competent and competitive but not usually outstanding. Many other technologies
are involved in RS/6000 implementations besides multiply-add, of course,
including in particular, advanced compiler optimization techniques.

     There are some potential issues with this multiply-add design.  The
numerical results are typically slightly different from what would be obtained
by a correctly-rounded multiply followed by a correctly-rounded add.  The
single-error results have a lower error bound and thus are more accurate; lit-
tle numerical software is written to exploit such better error bounds, how-
ever, as users of the Intel 80x87 and Motorola MC6888x have discovered.
Current RS/6000 implementations can be compiled in a conventional rounding
error mode which sacrifices the performance advantages of the multiply-add,
however.  While some users might expect that relatively standardized
floating-point instruction sets, based on a bitwise IEEE 754 standard for sin-
gle and double-precision arithmetic, would produce identical results on dif-
ferent systems, differences in expression evaluation order and intermediate
precision, base conversion, and elementary transcendental function implementa-
tion have conspired to prevent any significant bitwise portability so far.
The availability of compilers like GCC, that implement similar languages in
similar ways for all important RISC architectures, plus the advent of
networked computing where computational jobs may be assigned to any of dis-
similar hosts, suggests that in the future, however, bitwise portability among
nominally IEEE 754-conforming systems may become a reasonable and more common
requirement.

     Thus IBM's competitors are faced with achieving competitive parity of
performance in the face of uncertain future requirements with respect to accu-
racy and bitwise compatibility.  Furthermore they are mostly constrained to
compatibility with existing three-operand instruction formats.

Doubled-precision product

     In the days of fixed-point Frieden calculators, dot products were rou-
tinely accumulated in twice the precision of the data.  The same idea has per-
sisted through the years, although slowly yielding sway as the original
designers of computer arithmetic hardware and compilers, who were numerically
oriented, have been replaced by computer scientists.  For instance, Fortran-77
grudgingly incorporated the DPROD operator yielding a product of twice the
precision of its operands, but failure of instruction set architects to pro-
vide it and compiler architects to implement it as efficiently as possible
means that it's not used in mathematical software, despite that it need cost
no more than working-precision multiplication and affords significantly better
error bounds, on computations like dot product, without performing the entire
computation in doubled precision.

Doubled precision

     An alternative to quadruple precision in hardware, or a way to double
whatever working precision may be fully supported, is to use the techniques of
"doubled precision" dating back to 1951, and described in a memorandum of W.
Kahan dated 26 February 1987.  The idea is to represent a doubled-precision
variable, say X, as a sum of two working-precision variables, say x1 and x2.
Then doubled-precision floating-point arithmetic can be built up from
working-precision arithmetic, the efficiency and cleanliness of the result
reflecting the quality of the working-precision arithmetic and the support
tools available, such as doubled-precision product or three-operand addition.

     For instance, the following code for computing the doubled-precision  sum
Z=z1+z2  from the operands X=x1+x2 and Y=y1+y2 suffices for many (but not all)
needs:

        if |x1| < |y1| then swap X and Y
        t1 = (x2 + y1) + x1
        t2 = (((x1 - t1) + y1) + x2) + y2
        z1 = t1 + t2
        z2 = (t1 - z1) + t2

Neither magnitude compare nor swap instructions are available in most RISC ar-
chitectures,  and any kind of conditional branch is a problem for high perfor-
mance, but the main consideration here is that at least  11  ALU  instructions
seem to be required.

     If there were instructions for computing three-operand sums x  +  y  +  z
with  only  one rounding error, the foregoing could be simplified to something
like

        z1 = x2 + y2
        z1 = z1 + x1 + y1
        z2 = - z1
        z2 = z2 + x1 + y1
        z2 = z2 + x2 + y2

with only 5 ALU instructions.  I have intentionally limited myself to what  is
natural  on  most current RISC architectures, three-operand adds which replace
one of the operands with the result.

     The value of doubled-precision is seldom as a general doubled-precision
data type, but rather in limited contexts to provide extra precision just
where it is needed to preserve the working-precision accuracy of the final
result.  Kahan cites applications to ODE's and elementary transcendental func-
tion computations.

Quadruple precision

     Some of the RISC architectures, including at least SPARC V8 and (HP) PA-
RISC, specify 128-bit quadruple precision floating-point data formats and
instructions, although no hardware implementations of these instructions have
been announced.

     Some reasons for specifying quadruple precision:

*    Programs written to exploit Cray 128-bit floating-point precision are
     sometimes cited as reasons to stick with Cray mainframes instead of mov-
     ing to IEEE 754 workstations.  Less often cited are VAX H format and the
     128-bit formats of the 360/91 and its progeny.  That most of these cited
     programs would benefit from rewriting to use 64-bit precision primarily,
     with only occasional excursions into 128-bit precision at critical
     points, may be true but doesn't seem to carry much weight.  Neither does
     the rather poor relative performance of Cray 128-bit precision, for it is
     a doubled-precision implementation based on an untidy underlying 64-bit
     precision.

*    Relative to 96-bit extended precision, 128-bit quadruple precision con-
     sumes no more memory bandwidth on systems with 64-bit data busses.

*    Especially in Europe, there is some interest in incorporating the
     Kulisch-Miranker paradigm of correctly-rounded dot products into computer
     hardware and programming languages, in order to get satisfactory interval
     arithmetic bounds for computed results.  But in most cases, satisfactory
     bounds can be obtained by exploiting quadruple precision interval arith-
     metic, at far lower cost in programming effort and in run-time effi-
     ciency.

Complex arithmetic on RISCs

     None of the RISC architectures define complex data types, on the
reasonable-sounding basis that complex instructions can be fabricated from
simpler real instructions with just as high performance.

     In the CISC arena, I specified single-precision complex data types and
instructions for the Sun-3X FPA+, which despite fine electrical design by Mark
McPherson and fine microcoding by Fernando Urbina, was doomed to oblivion as
its intended CPU host, the Sun-3/4x0, (also a fine piece of work with some
innovative I/O performance features that persist in the 4/4x0), was mostly
overlooked in the unexpectedly sudden transition of Sun's customer base from
Sun-3's to SPARCstations.  By modifying the compiler to accept inline instruc-
tion templates for the FPA+ complex instructions, I was able to get most of
the performance I had expected, relative to what would have prevailed had
other things remained equal.  But after the project was underway, competitive
pressure for improved complex arithmetic performance in the Sun-4 line led the
compiler optimizer group to perform most complex arithmetic instructions
inline instead of by function calls, for both the Sun-4 and Sun-3; combined
with other optimizations that permitted, that eliminated most of the advantage
of the complex data type on the Sun-3.

     So the question for RISC instruction set architects is whether complex
arithmetic instructions pay off.  The criterion has to be cost, primarily in
terms of extra data paths and gates, versus benefit, namely performance
improvement on complex arithmetic codes.  Certainly for low-end systems there
is no advantage.  The kind of parallelism that complex instructions can
exploit (like vector instructions) doesn't pay off if there are no parallel
units to exploit.  Complex instruction sets would simply be microcoded and
would offer no performance advantage.  So the question has to be decided in
terms of high-end performance.  Does the opportunity for parallelism offer
performance enhancements that aren't available by exploiting simpler technolo-
gies like superscalar execution, chaining, loop unrolling, etc.?

     At least two of the "nasa7" benchmarks incorporated in SPEC depend on
complex arithmetic; one represents linear algebra and another signal process-
ing.

Complex multiply-add

     Combining the foregoing ideas, we can define a complex multiply-add
instruction in SPARC-like notation

        fzmuladdw       rs1,rs2,rd

as:

     take the 128-bit double-precision complex numbers in register rs1
     and rs2, form their 256-bit quadruple-precision complex product, add
     it to the 256-bit quadruple-precision complex accumulator in regis-
     ter rd, round the real and imaginary components once to 128-bit qua-
     druple precision, and store the 256-bit result in rd.

     Most other RISC instruction sets suggest similar extensions, subject to
availability of opcode space.  Note that evolution and standardization of the
SPARC architecture itself are now under the control of the SPARC International
consortium and an IEEE working group, P1754.

     To conserve hardware, we would like to combine the implementation with
that of other conceptually simpler instructions like

        fmulq           rs1,rs2,rd

defined in SPARC V8, which computes in rd the 128-bit quadruple-precision
floating-point product of the 128-bit quadruple-precision floating-point
operands rs1 and rs2, and an instruction that would help out (for instance in
correctly-rounded base conversion),

        fmulu           rs1,rs2,rd

which computes in rd the 128-bit unsigned integer product of the 64-bit
unsigned integer operands in rs1 and rs2.

Hardware unit

     One of the challenges of CMOS design seems to  be  how  to  use  all  the
available  cheap  gates  effectively.  The foregoing considerations suggest in
maximal form, a device containing six input ports from the register file, four
64x64   multiplier   arrays  that  each  produce  128-bit  products,  and  two
quadruple-precision three-operand adders.  Cheaper implementations have  fewer
functional units.

     The multipliers are connected to the  inputs  in  several  configurations
(but  short  of a complete crossbar) while there are also several ways to con-
nect the outputs of the multipliers to the three-operand adders, all  selected
by the various implemented op codes.  Let R1 and M1 be the 64-bit real and im-
aginary parts of the rs1 operand, with R2 and M2 the  corresponding  parts  of
rs2,  while RD and MD are the 128-bit real and imaginary parts of the destina-
tion.  The picture is something like the following, which I can get to work in
nroff but not troff:

width in bits:          128     64      64      64      64      128

                                    -----------------------
input stage:            RD      R1 /    M1      R2      M2 |     MD
                                  -       ----
                                |        |    \  |      |  |
                                             - --     --
                        |       |        |  /  \ |    | |  |      |
                                    ----- --  - - ----
                                |  |     |   /  \|      | /
                                 --       ---
multiply stage:         (dp)    d*      d*      d*      d*       (dp)

                         |     /       /          \        \       |

add stage:               ++-                                     +++

                         |                                        |

round stage:            (wp)                                     (wp)

output stage:            RD                                      MD

     The six operands are gathered in the input stage and routed to the pieces
that  need  them.  "d*" represents a multiplier that produces an exact product
of double the precision of its operands.  "++-" and  "+++"  are  three-operand
adders that can attach various configurations of signs to their input operands
according to the intended instruction; the indicated  signs  are  for  complex
multiply-add.   These  three-operand adders are "exact" in the sense that they
preserve enough information to allow the result to  be  correctly  rounded  to
doubled  precision  or  to  working precision.  "(dp)" represents a stage that
doubles an operand from working precision to twice that;  not  used  for  this
multiply-add because RD and MD are already doubled.  "(wp)" represents a stage
that rounds to either doubled precision or working precision according to  the
instruction  -  doubled  precision  in this case.  The round stage should feed
back to the multiply stage for chaining, but the picture above was complicated
enough already!

     fmulq is implemented by routing the various partial products to the adder
somewhat differently, and allowing for the fact that the lower order parts of
the multiplicands aren't in conventional floating-point format.  fmulu is
similar but integer multiplication and addition are performed at all stages
and rounding is bypassed.

     This looks a lot more like a microcoded vector processor than most RISC
floating-point units so far.  Ordinary multiplications and additions (and
divisions, sqrts, etc.) would either be processed by separate units optimized
for those simpler cases, or treated as special cases in this unit, with some
potential loss of performance depending on the details.

List of instructions

     Data type definitions: Let s be single, d double, q quad; c  single  com-
plex, z double complex, w quad complex.

     Operation definitions:

        muladd              rd = rd + rs1 * rs2
        mulsub              rd = rd - rs1 * rs2
        mulrsub             rd = rs1 * rs2 - rd
        madd                rd = |rs1| + |rs2|
        addadd              rd = rd + rs1 + rs2
        addsub              rd = rd + rs1 - rs2
        subsub              rd = rd - rs1 - rs2
        naddadd             rd = -rd + rs1 + rs2
        naddsub             rd = -rd + rs1 - rs2
        nsubsub             rd = -rd - rs1 - rs2

The following floating-point instructions could be implemented in the  forego-
ing:

                               List of supported instructions

Ordinary multiplication                                   fmul[sdq]
Exact doubled multiplication                              fsmuld, fdmulq
Ordinary complex multiplication with one rounding error   fmul[cz]
Doubled complex multiplication with one rounding error    fcmulz, fzmulw
Complex times real                                        fsmulc fdmulz

Ordinary addition                                         f{add,sub}[sdq]
Magnitude addition                                        fmadd[sdq]
Ordinary complex addition                                 f{add,sub}[cz]
Complex plus real                                         fs{add,sub,rsub}c fd{add,sub,rsub}z
3-operand addition with one rounding error
                                                          f{addadd,addsub,subsub}[sdq]
                                                          f{naddadd,naddsub,nsubsub}[sdq]

Ordinary real multiply-add with one rounding error        f{muladd,mulsub,mulrsub}[sd]
Doubled real multiply-add with one rounding error         fs{muladd,mulsub,mulrsub}d
                                                          fd{muladd,mulsub,mulrsub}q
Ordinary complex multiply-add with one rounding error     f{muladd,mulsub,mulrsub}[cz]
Doubled complex multiply-add with one rounding error      fc{muladd,mulsub,mulrsub}z
                                                          fz{muladd,mulsub,mulrsub}w

There are some unresolved conflicts between orthogonality, a  requirement  for
well-optimized  compiled  code,  and  minimal  sufficiency in a (sort of) RISC
design.  Instructions  that  can't  be  readily  implemented  in  the  complex
multiply-add  framework  presented  above  are intentionally excluded from the
list, despite orthogonality arguments.