notes from Hot Chips II Symposium

David G. Hough on validgh dgh
Sat Aug 25 16:36:03 PDT 1990


held at Santa Clara University August 20-21.  This is by no means a 
comprehensive report, just a list of points that I jotted down during some
of the talks that I was awake for, which was, alas, by no means all.
Unfortunately I had to miss the IBM RIOS talks completely.

There was a standard set of questions and answers provided by one of
the early speakers (who was talking about Intel 960) that was referred
to by each subsequent speaker that included:

when - I don't know
how much - I don't care
how does it compare to i860 - what's that?

All the reports are about chips, except for one on the SPEC and Perfect
Club benchmark suites, which I'll comment on first.

**************************************************************************

SPEC and Perfect Club Benchmarks - evaluation by R. Saavedra-Barrera,
was an attempt to characterize the SPEC and Perfect Club benchmarks so that
users could use benchmark results to project performance on their own
programs.  He points out that the different components of SPEC and Perfect
provide drastically different performance measurements, so it's hard to know
what to expect on a program not in the suite.

This is a feature, not a bug, however.  All high-performance implementations
have the property that they look relatively better on some programs and
relatively poor on others, compared to other high-performance implementations
that come to market at the same time.  The mountains and valleys in SPEC
results accurately reflect that and prove that whatever the faults of specific
benchmarks in SPEC - and most of them have at least one - they are uncorrelated
and so demonstrate the range of relative performance to be expected.  In my
view the goal of SPEC test program accumulation should be to accumulate a
minimal set of programs that are orthogonal in the sense that no subset 
has closely correlated performance on all current hardware implementations.
This goal MAXIMIZES the mountains and valleys effect.

The speaker was correct to the extent that he suggested that everybody
that uses SPEC results in marketing ought to accompany the results with a
report that explains the peaks and valleys in relation to the features of
a particular system.  MIPS does a pretty good job of this; Sun does 
internally, but I don't know if that information is public.

The speaker inadvertently chose a good example for demonstrating the problems
with this: he tried to explain the relatively poor performance of MIPS
systems on the matrix300 SPEC benchmark.  The explanation is quite simple:
matrix300 is basically a SAXPY but with two-dimensional array subscripts
in the source code.  MIPS compilers haven't done induction variable
optimization well for multi-dimensional subscripts.  Their focus has presumably
been elsewhere.

This simple explanation is fairly obvious to any compiler optimizer person
by looking at .s files.   Unfortunately the speaker attempted to figure out
what was going on by the simple method of counts of instructions executed.
He discovered that MIPS was generating more instructions.  This is true but
not what I call "the explanation".

(Multi-dimensional subscript induction variable optimization
is not a big deal.  Sun's compilers didn't handle multiple dimensions
on their early releases; we do now, and are getting better, but this is not
the biggest technical challenge we have ever faced, and we don't expect that
MIPS compilers will overlook that optimization for long.)

The biggest problem SPEC must deal with, in my view, is obtaining consensus
on adding any new programs to the suite.  From the original three or four
technical guys in a bar that could be trusted to do the right thing
technically, SPEC has grown into a rather large membership that I suspect
includes a fair number of professional standards bureaucrats who are mainly
there to protest the interests of their employers.  This suggests that
agreement to do more of essentially the same thing may be easier to reach
than agreement to do new orthogonal things.

**************************************************************************

SPEC - the NASA contractor in Austin, not the performance consortium that
invented the SPECmark - is developing a GaAs SPARC implementation
intended to run at 200 MHz with a 4-stage pipeline.  There are separate
32-bit instruction and data busses; this could be generalized to 64 bits
wide later.  The plan is to produce a module containing IU, FPU,
two MMU's for I and D, and 8 4Kx8 rams for the I and D caches.  The critical
path in the system turns out to be the 32-bit address adder in the integer
unit.

**************************************************************************

Metaflow - in San Diego, not to be confused with Multiflow or Metaware
or Metasoft -
is doing a superscalar CMOS.  At 40 MHz, a peak rate of 160 MIPS yields
an effective average rate of 86 MIPS and a SPECmark of 86.
Each clock can issue three nonbranch instructions and one branch.
There are three integer ALU's, an FPU ALU, and an FP multiplier,
as well as a branch unit.  Instructions are issued and executed
out of order, but this is somehow patched up so that if a trap occurs,
you don't observe anything out of order.  Speculative execution allows
running past conditional branches. 
Instructions per clock is rated at 2.0-2.2, compared to 1.2 for IBM RIOS
and 0.7 for MIPS R6000, and 0.6 for i860.
Performance estimates are based on some existing version of SPARCompilers;
they will eventually want to add their own optimization to replace
conditional branches with logical operations.  Many optimizations in
current compilers don't help much since they aren't oriented to the
peculiarities of a multiscalar implementation.
A "DCAF" content-addressable FIFO contains all out of order results,
speculative results, register-renamed results; retirement from the DCAF
is strictly in order and non-speculative.  Traps are like mis-predicted
branches.  There are about 150000 gates in the design that aren't RAM.
The register file has 9 read ports and 6 write ports.  Samples appear to
be promised for June 1991.

**************************************************************************

MIPS R6000 - The MIPS system (6280) based on the 6000 ECL chip has been
shipping since March, but hasn't yet reached "FCS".  (But as you move from
desktops toward mainframes that concept gets fuzzier).  At 60 MHz the
SPECmark is 44.  There are up to six VME busses and a gigabyte of RAM.
There is a two level cache at 8ns and 15ns respectively.
The CPU chip, FPU chip, and bus controller chip are each good for about
90000 transistors and 20 watts.  Cycles per instruction is currently
running 1.4-1.9.

Some people have questioned the future of ECL in this business, especially
since DEC is alleged to have cancelled its ECL SPIM line because of not enough
parts from BIT.  George Taylor, the MIPS speaker, thinks that it makes sense
to continue because you can design with lower voltages (2 volts instead of 5)
and if you do that, ECL remains competitive to GaAs.  And all the other
architectural directions of CMOS can be applied to ECL - wider data paths,
multiscalar, integrated CPU+FPU, deeper pipes, etc.

**************************************************************************

i960 CA - is a 3X superscalar implementation that has been working for
about a year.  Relative to the same chip without superscalar mode,
a 25% performance improvement was achieved for 13% more cost.  The 960CA
is 1.6-2.8X a 960KA at the same clock, but note the CA has no floating-point
hardware.  Compilers need improvement (same as everybody else).
The speaker Steve McGeady noted the following points for improvement:
resource conflicts, short basic blocks, calling conventions, ignorance
about memory cache structure and latency.

i960 XX - the next generation - was announced by McGeady.  This 1.5
million chip implementation has CPU, FPU, MMU all on board.  Floating-point
+-* are 3 cycles, / is 14.  20 MFLOPS DAXPY and SAXPY are expected 
at 50 MHz.  The "front bus" is 32 bits wide and runs at 44 MB/sec,
the "back bus" is 64 bit wide and 264 MB/sec.

**************************************************************************

Weitek - A. Quek disclosed the goals of Weitek's next project.  They
are hoping for 60 MHz from 0.8 micron CMOS with 360000 transistors,
including a 32x64 register array, generating 3 watts.  
This chip is currently in layout.

Everything except divide
and sqrt has 3 cycle latency.  Double-precision div and sqrt are
9 and 10 cycle latencies.
Only single and double precision floating point
are supported; underflow works like previous Weitek processors.  New
op codes include min and max.

The adder has interesting parallelism.  The true-add and true-subtract
cases are computed in parallel and the wrong answer thrown away at the
end.  Multiply is performed with a full array of carry save adders 
connected in an arithmetic progression scheme.  Division and sqrt
exploit the multiplier array to produce correctly-rounded results from
iterative starting points.

**************************************************************************

BIT believes in ECL, of course.  Greg Taylor introduced BIT's 3130,
which is targeted for 100 MHz in pipelined mode.  
It runs single and double precision arithmetic.  +, *, and div/sqrt
can run concurrently, but only one instruction can be initiated per cycle,
for a guaranteed speed limit of 200 MFLOPS.  (I guess I missed something
since that sentence is inconsistent).
There are 200000 transistors; the current process is 2.0 micron but
this design is scalable to 1.2 micron.

div and sqrt run at 200 MHz internally.  Overflow and underflow are
detected early to facilitate reporting precise exceptions.  Presumably
this is a little bit conservative - sometimes exceptions would be
false alarms since at the boundaries you can't be sure until the
final rounding.  (On a SPARC implementation that would be taken care
of automatically be the kernel floating-point trap handler.) 

**************************************************************************

The Motorola 96002 is a single-precision signal processor that wouldn't
be too interesting to me except that it provides one of the few, if
not the only, implementations of IEEE single-extended precision to back
up the single precision computations.  It also can handle subnormal
operands and results correctly at the cost of an occasional extra cycle.

**************************************************************************

Intel 4860 - the worst of both worlds - as presented by Hauppage Computer
Works.  That's my impression, not their title, for a description of 
their systems that combine i486 and i860.  Evidently systems are forthcoming
that will combine i860 CPU's running Unix with a i486 for DOS Windows.
The compilers need work.  So did the hardware, according to the speakers,
but they didn't want to go into details in public.  
They did say that "i860 signal timing is not as tight as i486,
resulting in less margin and harder design."

**************************************************************************

Intel is also working in a number of directions toward massive parallelism.
There were presentations on the Touchstone and Iwarp projects.



More information about the Numeric-interest mailing list