High and Lowlights of Hot Chips IV

Thu Aug 13 17:53:20 PDT 1992

Hot Chips IV was held August 10-11 at Stanford.   Some of the
presentations were less informative than I had hoped.    I hope that
the following doesn't misrepresent too much.    CPU's are discussed
last, for reasons to become clear.

Low Power:  There were several presentations about issues and products
optimized for low power consumption battery-operated systems.   If
these really take off, then they could become the true mass market for
computer chips, leaving x86/PC systems in second and all RISC systems
together in distant third.   If what they say is true, then everything
you have learned about instruction set architecture is wrong because
modern RISC hardware design and optimization techniques maximize speed
rather than minimize power consumption - CISCY designs have an
advantage to the extent that they minimize code size and hence memory
requirements.   Maybe UCSD Pascal and P code interpreters will rise
again.

The ARM610 is an example of such a modern design - 360000 transistors,
< 125 mA power consumption at 25 MHz.   The ATT Crisp/Hobbit is
another.  It keeps the active part of the stack in the register file.
The SPARC90 "chipset on a chip" puts everything on one chip... except
the floating-point unit.   I think it's intended for the embedded
market.   The best talk in this group was about lowering voltages,
frequencies, and design complexity in one sense, all in order to reduce
power.   The idea is that for many specific applications, if you can go
fast enough, there is no added value in going faster.  The complexity
attack is to reduce the number of transistors active at once, for
instance by using ripple carry instead of carry select adders.   The
frequency attack is to try to do more things pipelined or in parallel
to permit lower frequency.  The claim is that the optimal voltage is
probably 1.0-1.5 volts.

Memory:  An evening session was devoted to discussing various methods
of obtaining 1 GB bandwidth to DRAM memory systems.   Four methods were
discussed:  rambus, ramlink, cache DRAM, and synchronous DRAM.  All are
under development, but I don't know enough about this area to guess
which will prove most effective.

CM5 Vector Coprocessor:  Each CM5 processing node has a SPARC
microprocessor and 2-8 vector units as well, which also act as memory
controllers.   Each TI chip contains two vector units and a memory
controller.   TI created the vector processor to provide maximum
sustained MFLOPS/dollar on vector problems.   Existing technologies
were used to bring the hardware to market quickly, perhaps in order to
allow enough time for optimized compilers and libraries to be created.
There are multiply-adds, and vector instructions come in two sizes, 32
and 64 bit.  Stats are one million transistors, 40 MHz frequency and
hence 80 MFLOPS speed limit, 5 watts, 256 32-bit registers.  Evidently
the software and system level integration aren't complete since no
SPECnumbers were given; 50-60 MFLOPS are obtained in hand-coded
kernels.   The interface is memory- mapped rather than coprocessor
style.

Fujitsu MB92831 Vector Coprocessor:  This is a vector processor that
can execute double precision linpack at the rate of 43 MFLOPS at 70 MHz
from Fortran source, although the part is currently being used at 50
MHz.   It's intended to be used with Fujitsu's SPARClite embedded
processor, for which software support is available, although it can be
used with any CPU by anybody willing to roll their own software
support.  It's 1.5 million transistors and runs on 3.3 volts, for 4.5
watts at 50 MHz.  There is no concept of "context switch".   Thus a
system that wants to be able to run n accelerated vector processes at
once requires n vector units.  The 32 scalar registers are 32 bits;
there are 2000 32-bit registers for vector elements.   The part will be
used in the Meiko CS-2 massively parallel system.  Fujitsu expects it
to succeed the i860 family in attached array processor systems.

Alpha:  There wasn't much new information.   Some things that seemed
new to me I just may have missed the first time around.  Apparently
there are no "trap enable" bits in the floating-point control
register.   Presumably if you want to enable traps on e.g. division by
zero, you have to so declare at compile time.   The only mode bits are
the dynamic rounding modes, although static rounding modes are also
available as part of the fpop codes.  In the initial implementation,
there is a 43 amp peak current draw in the clock driver, which makes up
a significant fraction of this very fast chip.

Supersnake:  The HP 7100 superscalar PA-RISC chip achieves specint92 >
70 and specfp92 > 130 at 100 MHz.   9 ns SRAMS are required.  Evidently
there will be upgrade paths available from existing systems that use
the current memory chips, since such compatibility was mentioned as a
design goal.   On each cycle an integer and fpop can be dispatched, and
the fpop can be done of PA-RISC's unusual 5-operand multiply-adds, so
the speed limit is 200 MFLOPS.   There is hardware support for reducing
cache miss and TLB miss penalties.   There is also hardware support for
cheap two-CPU MP's, as well as more complicated support for more
complicated MP's.   There is apparently room for quite a bit more
compiler development to fully exploit all the hardware features.

Tsunami:  I think this was the first public presentation of Sun's next
low end CPU, but it answered none of the questions that might have been
on people's minds.   Instead it talked about exploiting existing
technology, expertise, and CAD tools to bring the project to volume
production in minimal time.  Tsunami supports 32-bit SBus rather than
64-bit.  In response to a question, the speaker indicated that was what
the mass market was expected to be.  For more details about the chips,
contact Texas Instruments, the manufacturer.

hyperSPARC:  Cypress/Ross intends hyperSPARC modules to be
high-performance plug-compatible replacements for existing MBus module
CPU's.  Each module contains one CPU/FPU chip, one memory controller
chip, and one or two cache SRAMs.   The CPU is two-scalar with a number
of minor optimizations relative to earlier designs:  fpops can be
queued after issue and before launch to avoid stalling integer ops; the
sethi used to set a 32 bit constant or load from an arbitrary address
may have zero execution time; a condition-setting integer op and
subsequent conditional branch may be launched simultaneously.   The
floating-point performance goal is 2X a SPARCstation 2.   100 MHz clock
rates are intended to be reached eventually.

P5:   The previous Ross speaker was somewhat perfunctorily introduced
by the Intel session chair, who apparently felt that the P5
presentation was the highlight of Hot Chips IV.  The last two
presentations were for the integer and floating-point parts of the
chip.  Indeed, anticipation was intensified by Intel's not handing out
copies of their slides until right before their session - everybody
else's were ready prior to the conference.   The anticipation was
unwarranted, however, since neither the printed slides nor the
presentations answered any of the key questions.  Instead both Intel
speakers were required to read statements from their lawyers to the
effect that these parts were unannounced and subject to pending and
future patent applications, lending credence to the usenet joke that
the Intel decided to delay P5 in order to hire more lawyers.   A
re-reading of the same statement, by the second speaker, was not warmly
received.

These nuggets were disclosed:  3 million transistors, the process is
0.8 um, correctly-predicted branches execute in zero time, cache
consistency is maintained in hardware, instruction and data caches are
separate, there are two almost-identical pipelines to support
2-superscalar execution in this case:

	* both instructions are "simple" (alu/move, reg-reg, imm-reg,
	mem-reg, reg-mem, branches) 
	* destination of first instruction is not a source or destination of
	second operand

The on-chip data cache supports access from both pipelines.  Remarkably
P5 is fully upward compatible with 486.   Nothing was said about downward
compatibility.

Compilers are to be forthcoming from all major vendors for all major
operating systems.   Optimizing compilers should

	* use simple instruction formats to maximize dual issue 
	* schedule instructions to minimize address generation conflicts
	* do register allocation and instruction scheduling together to
	make the best use of the small CPU register set

The performance goal of the floating-point unit is 4-5X a 33 MHz 486 on
"scalar" code and 6-10X on "vector" code.  The FPU has three arithmetic
units and an eight stage pipeline, with one cycle throughput; the adder
and multiplier actually have three-cycle latency.  fpops only
execute from one of the two CPU pipelines.   So the MFLOPS speed limit
will be 1X the internal clock rate.  The design was tuned for
double-precision memory-register operations, since those are most
common in extended-precision based designs.

Arithmetic exceptions are detected early in a manner which (in response
to a question) was stated to not impinge on MIPS' patent in this area.
In order to generate precise exceptions, instructions which are
potentially exceptional are not pipelined.   But note that under the
design emphasis on (double precision mem) op (extended reg) ->
(extended reg) exceptions such as underflow and overflow are even rarer
than usual.  Since FXCH is often generated in stack-oriented code, FXCH
was designed to be zero cost in many cases.

Transcendental functions have been reworked to be faster and more
accurate, to better than 1 ulp, in order to be 4-6X faster than 33 MHz
486 (but nobody ever said what the clock rate of P5 would be).  This
also means that the results will be slightly different from 486 on rare
occasions, just as 486/387 was slightly different from 8087.

Appropriate compiler optimizations include loop unrolling and
exploiting the parallel FXCH capability.