preprints on division/sqrt

David G. Hough at validgh dgh
Tue Dec 20 22:47:38 PST 1994


From: melarocky.cs.cornell.edu (Miriam Leeser)
Subject: Re: Speed of FP divide hardware
Message-ID: <MEL.94Dec19162343arocky.cs.cornell.edu>
Date: 19 Dec 94 21:23:43 GMT
References: <3cth1m$g4fanntp.Stanford.EDU>
Sender: melacs.cornell.edu (Miriam Leeser)
Organization: Cornell Univ. CS Dept, Ithaca NY 14853
Lines: 56
In-Reply-To: obermanaumunhum.stanford.edu's message of 17 Dec 1994 02:05:42 GMT

Joe Keane (jgkanetcom.com) wrote:
> 
> Besides all that, i think that FDIV is just too slow.  Can't Intel come up
> with a better divide algorithm than one that gives only two bits at a time?

There are two principal ways of doing division in current
microprocessors.  Subtractive algorithms, such as radix-4 SRT, are
used in the Pentium, HP 7200, Mips R4000, and PowerPC 601, among
others.  Multiplicative algorithms, such as the Newton-Raphson method
and Goldschmidt's algorithm, are used in the IBM RS/6000 series and
recent SPARC implementations from Texas Instruments.

Multiplicative implementations make use of the floating-point
multiplier, which means that additional area requirements can be kept
small, but also that the multiplier is unavailable while division is
in progress.  Improving performance means accommodating lookup
tables which grow at an exponential rate with the increase in initial
guess accuracy.  There is also the problem of compliance with IEEE
last-bit accuracy alluded to in previous posts.  Claimed latencies
range from 7 to 19 machine cycles for double precision accuracy.


Efficient subtractive implementations require dedicated hardware; the
area expense is greater than the slowest multiplicative
implementations, but considerably smaller than the fastest ones.  The
latencies also tend to be longer; however, subtractive dividers have
the advantage of operating in parallel with the rest of the FPU.
There are other twists; the required hardware is sufficiently simple
that in some cases the divider can be run at twice the clock rate of
the rest of the chip.  For example, the radix-4 implementation in the
HP PA7200 actually produces 4 rather than 2 quotient bits per machine
cycle.  Subtractive division latencies range from 15 to 61 machine
cycles, with the median close to 30.

How all of these different factors play themselves out in real
implementations is a complex matter.  I been working on these problems
and the closely related issue of square root implementation with
Prof. Miriam Leeser.  Copies of our technical report "Area and
Performance Tradeoffs in Floating-Point Division and Square Root
Implementations" are available by anonymous ftp from

tesla.ee.cornell.edu

in the file

/pub/floating-point/div-sqrt-survey.ps.Z

This document surveys subtractive and multiplicative algorithms and
the performance and area tradeoffs that affect their implementation in
microprocessor FPU's.  There is another report, "An Area/Performance
Comparison of Subtractive and Multiplicative Divide/Square Root
Implementations" which presents the same results but omits much of the
background information.  It is located in div-sqrt-tradeoffs.ps.Z in
the same directory.  Look at /pub/floating-point/README for more
information.

I hope that some of you will find these documents useful.  I welcome
any comments you may have on my work.

Thank you,

Peter Soderquist
pgsaee.cornell.edu




More information about the Numeric-interest mailing list