the hardware giveth and the compiler taketh away ...

Sat Feb 16 09:25:21 PST 1991

 ... and eventually the hardware taketh away too, unless the potential
users make a big enough deal about it.   Almost nobody (in a mass
market) would pay extra for integer multiply or floating-point sqrt;
after all they just want to run their spreadsheets or word processors
or graphics applications in their window system.  [Or maybe they don't
want to pay for any number crunching at all, they just want to pass
their command and control messages securely and efficiently on a
public network.]

MIPS forgot (or couldn't fit) sqrt, Sun forgot (or couldn't fit)
integer multiply.  No kind of naive statistical analysis of "average"
computing loads on existing computers would have suggested that either
are worth a lot of gates; you have to find out the justification some
other way.

The appended message from comp.arch basically makes the Herman Rubin
point somewhat more succinctly than Herman usually does: that if the only
software you ever write is kernels, device drivers, text editors,
compilers, and linkers, then you design languages in which it is only
convenient to write kernels ... linkers; programs written in such
languages by necessity look like kernels ...  linkers, even if they're
not; if you design hardware according to statistics from such programs,
it ends up optimized for that same set of jobs and really hard to apply
to any other.  This cuts both ways of course; all computers designed by
Seymour Cray are highly tuned for large physical simulations with
inexact initial conditions but relatively awkward for other tasks.  But
Seymour is no longer in the mainstream of computer architecture
represented by SPARC, MIPS, RS6000, PA-RISC, and 88000.

In the case at hand, SPARC V8 describes integer multiplication
instructions, but it's only a paper spec, albeit a candidate to become
IEEE Standard 1754.  The signed and unsigned integer multiply
instructions produce 64-bit results from 32-bit operands, as they
should.

Complete realizations of the V8 instruction set are in the future.  But
even with the hardware, we still have a problem in C or Fortran, namely
how to efficiently exploit doubled-precision products.  Even if you
defined a 64-bit integer type, you're likely to have had trouble
convincing X3J11 to put a corresponding doubled-precision product
operator into C, on the grounds that it wasn't existing practice or it
wasn't efficient or meaningful on some current hardware or if you
wanted you could do it as an "as-if" permissible optimization - the
bottom line being that you can't use it in ostensibly portable and
efficient C code.  So of course, the statistics of the next generation
of C compilers will reinforce the notion that you don't need such a
thing.  So to use such hardware at all, you have to define a
machine-dependent function, for which the function call overhead may
dwarf the time spent doing the multiply.  Remember that function call
overhead includes, for instance, all the extra stuff you have to have
when a function that would otherwise enjoy leaf-node optimization now
has to allow for an external call.

Sun compilers have had inline instruction templates that allow users to
replace functions with sequences of assembly language, but they are no
panacea in this case.  Functions that depend on funny registers like
 %y (where the upper 32 bits of the product will go in SPARC V8 integer
multiply) and %fsr that aren't used in normal compiler-generated code
tend to fare poorly because of peculiar hardware restrictions and bugs
in local optimizers that don't really understand these registers.  For
instance, on SPARC, writing the %fsr or %y has an undefined effect on
the next three instructions or so (to allow for pipelining), and there
is no hardware interlocking to protect you (too big an impact on
overall critical path), so it's only really safe to use these registers
in assembly-language procedures in which all local optimizations have
been disabled.  If you use the otherwise wonderful inline expansion
template mechanism, your three following nop's may get re-arranged or
re-scheduled in surprising ways, defeating your intent.

(I'm picking on Sun compilers today because I'm working on a major
flame directed at some of our competitors.  Stay tuned.)

I think the correct executive summary is that architecture for
commercially-successful computers is harder than it looks.  There's
quite a balance of understanding required among what customers want to
do, what they need to do and how they would do it if they could, and
how to provide the necessary pieces in hardware, languages, and
libraries to allow all this to happen efficiently, and also in a way
that we can get to from where we are now.

-----------------------------------------------------------------------

From: shandaprl.dec.com (Mark Shand)
Newsgroups: comp.arch
Subject: Re: integer multiplies on a Sparc
Date: 13 Feb 91 13:40:57 GMT
Organization: Digital Equipment Corporation - Paris Research Laboratory

Integer multiply on SPARC is indeed poor.  I recently added
an assembler kernel for SPARC to our bignum package and found
the fastest way to do multiprecision integer multiply was
through the FPU.  The primitive I use is 32bitx16bit->48bit which
can be computed exactly in double precision.  I've only timed it
on a SPARCstation 1 which has a rather slow 9 cycle DP mult.
The overall performance for multiprecision integer multiplies
is about 4 times less than a MIPS R2000
which has a 12-16 (depending how you count) cycle 32x32->64
integer mult, but is still faster than any other way of doing
full-word integer mult on an early SPARC.

(our bignum package is available by mail from librarianaprl.dec.com,
we will be announcing an FTP server soon)

Even on a more balanced machine like the MIPS R2000,R3000 floating
mult, although more resource intensive than integer mult, is a
higher priority operation and, through the devotion of more hardware,
takes fewer cycles.

Moral: tradeoffs between integer vs float are subtle, just because
an operation CAN be implemented more efficiently doesn't mean it
HAS BEEN.

Of course next year's CPU designers will benchmark your neural net code
that you've finally decided to cast in floats even though ints would
have served you equally well, and those designers will deprecate
integer multiply even further.

Questions:

Does anyone know which SPARC implementations include integer multiply
support beyond the multiply step instruction?  What is the opcode?
What happens if an early SPARC hits such an opcode?  Have these SPARC
implementations found their way into any product machines yet?

Another thing that bugged me about multiply step was that it doesn't
seem to give any way to get the high order part of the result.
MIPS on the contrary gives you lo and hi result registers.  This
is essential in multiprecision work.  Am I missing something in
multiply step?  Do the newer instruction help here?

Mark Shand.