epsilon

Mon Apr 22 08:56:24 PDT 1991

A posting about machine epsilon from Nick Higham was part of the latest 
na-net digest, followed by comments by Cleve Moler.  Both are reproduced
below.  There's one point that needs further correction:

>MATLAB on a PC has IEEE floating point with extended precision implemented
>in an Intel chip.  The C compiler generates code with double rounding.
>MATLAB on a Sun Sparc also has IEEE floating point with extended precision,
>but it is implemented in a Sparc chip.  The C compiler generates code
>which avoids double rounding.

The last Sun compilers for the Sun-386i (C 1.0 and Fortran 1.3.1) also generate
code with double rounding when -f80387 is specified, as do the last Sun
compilers for the Sun-3 (C 1.1 and Fortran 1.4) with -f68881.  

However although there
was a 64-significant-bit extended precision format specified in SPARC V7
and there is a 113-significant bit (quadruple) extended precision format
specified in SPARC V8, neither have been implemented in hardware, and no
SPARC compilers generate those instructions.  Implemented SPARC opcodes
are single or double precision and thus cause no double roundings, except
when traditional C uses double-precision expression evaluation for single-
precision variables, which may be defeated on Suns with -fsingle.

-------------------------------------------------------------------------

The main point to be drawn from the na-net discussion, which is also one
I tried unsuccessfully to convey to X3J11 and X3J3, is that the common
environmental inquiries are rather less precisely defined than one would like,
which is however offset by the fact that they are much less general than
one would like.

IEEE 754 recommends provision of a nextafter(x,y) function which tells
you what you really want to know about the set of representable numbers.
After several years of experience implementing that in software I'm convinced
it's not worth the trouble and prefer instead next(x) or nextpositive(x)
which is equivalent to nextafter(x,+infinity), is more often what I want
anyway, and is probably much easier to implement in hardware.  For another
point of view, ask Kahan about solving a nonlinear equation; he likes
nextafter(x,y) for that purpose but I think a software implementation of
nextafter based on hardware nextpositive would be just as efficient for his
purposes and more efficient for most, especially since it's more implementable
in a RISC environment.

----------------------------------------------------------------------------

From: Nick Higham <mbbgsnhacms.manchester-computing-centre.ac.uk>
Date: Sat Apr 13 14:31:37 PDT 1991
Subject: Three Measures of Precision in Floating Point Arithmetic

This note is about three quantities that relate to the precision of
floating point arithmetic.  For t-digit, rounded base b arithmetic the
quantities are

(1) machine epsilon (eps), defined as the distance from 1.0 to
    the smallest floating point number bigger than 1.0 (and
    given by eps = b**(1-t), which is the spacing of the
    floating point numbers between 1.0 and b),

(2) mu = smallest floating point number x such that fl(1 + x) > 1, and

(3) unit roundoff u = b**(1-t)/2 (which is a bound for the relative
    error in rounding a real number to floating point form).

The terminology I have used is not an accepted standard; for example,
the name machine epsilon is sometimes given to the quantity in (2).
My definition of unit roundoff is as in Golub and Van Loan's book
`Matrix Computations' [1] and is widely used.  I chose the notation eps
in (1) because it conforms with MATLAB, in which the permanent
variable eps is the machine epsilon.  [Ed. note: Well, not quite. 
See my comments below.  --Cleve]

The purpose of this note is to point out that it is not necessarily the
case that mu = eps, or that mu = u, as is sometimes claimed in the
literature, and that, moreover, the precise value of mu is difficult to
predict.

It is helpful to consider binary arithmetic with t = 3.  Using binary
notation we have

         1 + u = 1.00 + .001 = 1.001,

which is exactly half way between the adjacent floating point numbers
1.00 and 1.01.  Thus fl(1 + u) = 1.01 if we round away from zero when
there is a tie, while fl(1 + u) = 1.00 if we round to an even last
digit on a tie.  It follows that mu <= u with round away from zero (and
it is easy to see that mu = u), whereas mu > u for round to even.
I believe that round away from zero used to be the more common choice
in computer arithmetic, and this may explain why some authors define
or characterize u as in (2).  However, the widely used IEEE standard
754 binary arithmetic uses round to even.

So far, then it is clear that the way in which ties are resolved
in rounding affects the value of mu.  Let us now try to determine the
value of mu with round to even.  A little thought may lead one to
suspect that mu <= u(1+eps).  For in the b = 2, t = 3 case we have

          x = u*(1+eps) = .001*(1+.01) = .00101

     =>   fl(1 + x) = fl( 1.00101 ) = 1.01,

assuming ``perfect rounding''.    I reasoned this way, and decided to
check this putative value of mu in 386-MATLAB on my PC.
MATLAB uses IEEE standard 754 binary arithmetic, which has t = 53
(taking into account the implicit leading bit of 1).  Here is what I
found:

     >> format compact; format hex
     >> x = 2^(-53)*(1+2^(-52)); y = [1+x 1 x]
     y =
        3ff0000000000000   3ff0000000000000   3ca0000000000001

     >> x = 2^(-53)*(1+2^(-11)); y = [1+x 1 x]
     y =
        3ff0000000000000   3ff0000000000000   3ca0020000000000

     >> x = 2^(-53)*(1+2^(-10)); y = [1+x 1 x]
     y =
        3ff0000000000001   3ff0000000000000   3ca0040000000000

Thus the guess is wrong, and it appears that mu = u*(1+2^(42)*eps)
in this environment!   What is the explanation?

The answer is that we are seeing the effect of ``double-rounding'', a
phenomenon that I learned about from an article by Cleve Moler [2].
The Intel floating-point chips used on PCs implement internally the
optional extended precision arithmetic described in the IEEE standard,
with 64 bits in the mantissa [3].  What appears to be happening in the
example above is that `1+x' is first rounded to 64 bits; if
x = u*(1+2^(-i)) and i > 10 then the least significant bit is lost in
this rounding.  The extended precision number is now rounded to 53 bit
precision; but when i > 10  there is a rounding tie (since we have
lost the original least significant bit) which is resolved to 1.0,
which has an even last bit.

The interesting fact, then, is that the value of mu can vary even
between machines that implement IEEE standard arithmetic.

Finally, I'd like to stress an important point that I learned from the
work of Vel Kahan: the relative error in addition and subtraction
is not necessarily bounded by u.  Indeed on machines such as Crays
that lack a guard digit this relative error can be as large as 1.  For
example, if b = 2 and t = 3, then subtracting from 1.0 the next
smaller floating number we have

  Exactly:              1.00-
                         .111
                        -----
                         .001

  Computed, without     1.00-
  a guard digit:         .11      The least significant bit is dropped.
                        -----
                         .01

The computed answer is too big by a factor 2 and so has relative
error 1!  According to Vel Kahan, the example I have given mimics what
happens on a Cray X-MP or Y-MP, but the Cray 2 behaves differently and
produces the answer zero.  Although the relative error in
addition/subtraction is not bounded by the unit roundoff u for
machines without a guard digit, it is nevertheless true that

            fl(a + b) = a(1+e) + b(1+f),

where e and f are bounded in magnitude by u.

[1]  G. H. Golub and C. F. Van Loan, Matrix Computations, Second Edition,
Johns Hopkins Press, Baltimore, 1989.

[2]  C. B. Moler, Technical note: Double-rounding and implications for
numeric computations, The MathWorks Newsletter, Vol 4, No. 1 (1990), p. 6.

[3]  R. Startz, 8087/80287/80387 for the IBM PC & Compatibles, Third
Edition, Brady, New York, 1988.

----------------------------------------------------------------------------

Editor's addendum [Cleve Moler]:

I agree with everything Nick has to say, and have a few more comments.

MATLAB on a PC has IEEE floating point with extended precision implemented
in an Intel chip.  The C compiler generates code with double rounding.
MATLAB on a Sun Sparc also has IEEE floating point with extended precision,
but it is implemented in a Sparc chip.  The C compiler generates code
which avoids double rounding.

On both the PC and the Sparc

    eps = 2^(-52) = 3cb0000000000000 = 2.220446049250313e-16

However, on the PC

    mu = 2^(-53)*(1+2^(-10)) = 3ca0040000000000 = 1.111307226797642e-16

While on the Sparc

    mu = 2^(-53)*(1+2^(-52)) = 3ca0000000000001 = 1.110223024625157e-16

Note that mu is not 2 raised to a negative integer power.

MATLAB on a VAX usually uses "D" floating point (there is also a "G"
version under VMS).  Compared to IEEE floating point, the D format has
3 more bits in the fraction and 3 less bits in the exponent.  So
eps should be 2^(-55), but MATLAB says eps is 2^(-56).  It is actually
using the 1+x > 1 trick to compute what we're now calling mu.  There
is no extended precision or double rounding and ties between two floating
point values are chopped, so we can find mu by just trying powers of 2.

On the VAX with D float

    eps = 2^(-55) =  2.775557561562891e-17

    mu =  2^(-56) =  1.387778780781446e-17

The definition of "eps" as the distance from 1.0 to the next floating
point number is a purely "geometric" quantity depending only on the
structure of the floating point numbers.  The point Nick is making is
that the more common definition of what we here call mu involves a
comparison between 1.0 + x and 1.0 and subtle rounding properties of
floating point addition.  I now much prefer the simple geometric
definition, even I've been as responsible as anybody for the popularity
of the definition involving addition.