DEC ALPHA description

Tue Feb 25 10:47:55 PST 1992

Note no D format, no integer divide, double precision only,
no sqrt, no condition code register,
no multiplier extension register, imprecise traps, static mode bits ...
A mixture of good and bad ideas, but definitely significant.
 From comp.arch:

   ALPHA ARCHITECTURE TECHNICAL SUMMARY 
   Dick Sites, Rich Witek

   [NOTE: "Alpha" is an internal code name. An official name will be announced
    soon.]

   WHAT IS ALPHA?

   Alpha is a 64-bit RISC architecture, designed with particular emphasis on 
   speed, multiple instruction issue, multiple processors, software migration 
   from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected 
   any feature that did not appear to be usable for at least 25 years.

   The first chip implementation runs at up to 200 MHz.  The speed of Alpha 
   implementations is expected to scale up from this by at least a factor of 
   1000 over the next 25 years. 

   FORMATS

   Data Formats

   Alpha is a load/store RISC architecture with all operations done between
   registers. Alpha has 32 integer registers and 32 floating registers, each
   64 bits. Integer register R31 and floating register F31 are always zero.
   Longword (32-bit) and quadword (64-bit) integers are supported. Four
   floating datatypes are supported: VAX F-float, VAX G-float, IEEE single
   (32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual
   little-endian byte addresses. 

   Instruction Formats

   Alpha instructions are all 32 bits, in four different instruction formats
   specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode. 

	   +-----+-------------------------+
	   | OP  |		number		| PALcall
	   +-----+----+--------------------+
	   | OP  | RA |	disp		| Branch
	   +-----+----+----+---------------+
	   | OP  | RA | RB |    disp	| Memory
	   +-----+----+----+----------+----+
	   | OP  | RA | RB |  func.   | RC	| Operate
	   +-----+----+----+----------+----+

   PALcalls specify one of a few dozen complex operations to be performed.

   Conditional branches test register RA and specify a signed 21-bit
   PC-relative longword target displacement. Subroutine calls put the return
   address in RA. 

   Loads and stores move longwords or quadwords between RA and memory, using 
   RB plus a signed 16-bit displacement as the memory address.

   Operates use source registers RA and RB, writing result register RC. There 
   is an extended opcode in the 11-bit function field. Integer operates can use 
   the RB field and part of the function field to specify an 8-bit 
   zero-extended literal.

   INSTRUCTIONS

   PALcall Instructions

   The Privileged Architecture Library call instructions specify one of a few
   dozen complex functions to be performed. These functions deal with
   interrupts and exceptions, task switching, virtual memory, and other
   complex operations that must be done atomically. PALcall instructions
   vector to a privileged library of software subroutines (using the same Alpha 
   instruction set) that implement an operating-system-specific set of these 
   complex operations. 

   Branch Instructions

   Conditional branch instructions can test a register for positive/negative
   or for zero/nonzero. They can also test integer registers for even/odd. 
   Unconditional branch instructions can write a return address into a 
   register. There is also a calculated jump instruction the branches to an 
   arbitrary 64-bit address in a register.

   Load/Store Instructions

   Load and store instructions can move either 32- or 64-bit aligned
   quantities. The VAX floating-point load/store instructions swap words to
   give a consistent register format for floats. Memory addresses are flat
   64-bit virtual addresses, with no segmentation. A 32-bit integer datum is
   placed in a register in a canonical form that makes 33 copies of the high
   bit of the datum. A 32-bit floating datum is placed in a register in a
   canonical form that extends the exponent by 3 bits and extends the fraction
   with 29 low-order zeros. 32-bit operates preserve these canonical forms. 

   There are no 8- or 16-bit load/store instructions, but there are facilities 
   for doing byte manipulation in registers.

   Alpha has no 32/64 mode bit or other such device. Compilers, as directed by 
   user declarations, can generate any mixture of 32- and 64-bit operations.

   Integer Operate Instructions

   The integer operate instructions manipulate full 64-bit values, and include
   the usual assortment of arithmetic, compare, logical, and shift
   instructions. There are just three 32-bit integer operates: add, subtract,
   and multiply. These differ from their 64-bit counterparts ONLY in overflow
   detection and in producing 32-bit canonical results. 

   There is no integer divide instruction.

   In addition to the operations found in conventional RISC architectures,
   there are scaled add/subtract for quick subscript calculation, 128-bit
   multiply for division by a constant and multiprecision arithmetic,
   conditional moves for avoiding branches, and an extensive set of
   in-register byte manipulation instructions for avoiding single-byte writes.

   Rather then keeping a global state bit for integer overflow trap enable,
   the enable is encoded in the function field of each instruction. Thus, both
   ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without
   overflow checking. This makes pipelined implementations easier.

   Floating-point Operate Instructions

   The floating operate instructions include four complete sets of VAX and
   IEEE arithmetic, plus conversions between float and integer. 

   There is no floating square root instruction.

   In addition to the operations found in conventional RISC architectures, 
   there are conditional moves for avoiding branches, and merge sign/exponent 
   instructions for simple field manipulation.

   Rather then keeping global state bits for arithmetic trap enables and
   rounding mode, these enable and mode bits are encoded in the function field
   of each instruction. 

   SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS

   First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit 
   instructions. It is not a 32-bit architecture that was later expanded to 64
   bits. 

   Second, Alpha was designed to allow very high-speed implementations. The
   instructions are very simple (no load-four-registers-unaligned-and-check-
   for-bytes-of-zero). There are no special registers that would prevent
   pipelining multiple instances of the same operations (no MQ register and no
   condition codes). The instructions interact with each other ONLY by one
   instruction writing a register or memory, and another one reading from the
   same place. This makes it particularly easy to build implementations that
   issue multiple instructions every CPU cycle. (The first implementation
   in fact issues two instructions every cycle.) There are no
   implementation-specific pipeline timing hazards, no load-delay slots, and
   no branch-delay slots. These features would make it difficult to maintain
   binary compatibility across multiple implementations and difficult to
   maintain full speed on multiple-issue implementations. 

   Alpha is unconventional in the approach to byte manipulation. Single-byte
   stores found in conventional RISC architectures force cache and memory
   implementations to include byte shift-and-mask logic, and sequencer logic
   to perform read-modify-write on memory words. This approach is awkward to
   implement quickly, and tends to slow down cache access to normal 32- or
   64-bit aligned quantities. It also makes it awkward to build a high-speed
   error-correcting write-back cache, which is often needed to keep a very
   fast RISC implementation busy. It also can make it difficult to pipeline
   multiple byte operations. 

   Instead, the byte shifting and masking is done in Alpha with normal 64-bit
   register-to-register instructions, crafted to keep the sequences short.

   Alpha is also unconventional in the approach to arithmetic traps. In
   contrast to conventional RISC architectures, Alpha arithmetic traps
   (overflow, underflow, etc.) are imprecise -- they can be delivered an
   arbitrary number of instructions after the instruction that triggered the
   trap, and traps from many different instructions can be reported at once.
   This makes implementations that use pipelining and multiple issue
   substantially easier to build. 

   If precise arithmetic exceptions are desired, trap barrier instructions can
   be explicitly inserted in the program to force traps to be delivered at
   specific points. 

   Alpha is also unconventional in the approach to multiprocessor shared
   memory. As viewed from a second processor (including an I/O device), a 
   sequence of reads and writes issued by one processor may be arbitrarily 
   reordered by an implementation. This allows implementations to use 
   multi-bank caches, bypassed write buffers, write merging, pipelined writes 
   with retry on error, etc. If strict ordering between two accesses must be
   maintained, memory barrier instructions can be explicitly inserted in the
   program. 

   The basic multiprocessor interlocking primitive is a RISC-style
   load_locked, modify, store_conditional sequence. If the sequence runs
   without interrupt, exception, or an interfering write from another
   processor, then the conditional store succeeds. Otherwise, the store fails
   and the program eventually must branch back and retry the sequence. This
   style of interlocking scales well with very fast caches, and makes Alpha an
   especially attractive architecture for building multiple-processor systems.

   Alpha includes a number of HINTS for implementations, all aimed at allowing 
   higher speed. Calculated jumps have a target hint that can allow much 
   faster subroutine calls and returns. There are prefetching hints for the 
   memory system that can allow much higher cache hit rates. There are also
   granularity hints for the virtual-address mapping that can allow much more 
   effective use of translation lookaside buffers for big contiguous 
   structures.

   Alpha includes a very flexible privileged library of software for operating-
   system-specific operations, invoked with PALcalls. This library allows Alpha
   to run full VMS using one version of this software library that mirrors many
   of the VAX operating-system features, and to run OSF/1 using a different
   version that mirrors many of the MIPS operating-system features, and
   similarly for NT. Other versions could be tailored for real-time, teaching,
   etc. The PALcalls allow Alpha to run VMS with hardly more hardware than
   a a conventional RISC machine has (the PAL mode bit itself, plus 4 extra
   protection bits in each TB entry). This library makes Alpha an especially
   attractive architecture for multiple operating systems. 

   Finally, Alpha is not strongly biased toward only one or two programming 
   languages. It is an attractive architecture for compiling at least a dozen 
   different languages.

   SUMMARY

   Alpha is designed to be a leadership 64-bit architecture.

   --------------------
       Specifications (150MHz version).

       Process Technology          .75 micron CMOS 

       Cycle Time                   150 MHz (6.6 ns)

       Die Size                     13.9mm x 16.8mm

       Transistor Count             1.68 million

       Package                      431 pin PGA

       Number of Signal Pins        291

       Power Dissipation            23 W at 6.6 ns cycle

       Power Supply                 3.3 volts

       Clocking Input               300 MHz differential 

       On-chip D-cache              8 Kbyte, physical, direct-mapped,
				    write-through, 32-byte line, 32-byte fill

       On-chip I-cache              8 Kbyte, physical, direct-mapped,
				    32-byte line, 32-byte fill, 64 ASNs

       On-chip DTB                  32-entry; fully-associative; 8-Kbyte,
				    64-Kbyte, 256-Kbyte, 4-Mbyte page sizes

       On-chip ITB                  8-entry, fully associative, 8-Kbyte page
				    plus 4-entry, fully-associative, 4-Mbyte page

       Floating Point Unit          On-chip FPU supports both IEEE and VAX
				    floating point

       Bus                          Separate data and address bus.
				    128-bit/64-bit data bus

       Serial ROM Interface         Allows the chip to directly
				    access serial ROM

       Virtual Address Size         64 bits checked; 43 bits
				    implemented

       Physical Address Size        34 bits implemented

       Page Size                    8 Kbytes

       Issue Rate                   2 instructions per cycle to A-box,
				    E-box, or F-box

       Integer Pipeline             7-stage pipeline

       Floating Pipeline            10-stage pipeline