Square root and double rounding

Douglas Priest uunet!Eng.Sun.COM!Douglas.Priest
Sun Jun 25 11:36:01 PDT 1995


Samuel Figueroa asks whether the square root of a p-bit floating point
number, rounded first to 2p+1 bits, then to p bits, must deliver the
same result as rounding the square root directly to p bits, provided
all rounding is performed according to the IEEE 754 default round-to-
nearest-even specification.  I believe the answer is no, in general.
More specifically, assuming all rounding is performed according to the
IEEE 754 round-to-nearest-even rule, I claim:

Proposition: Let a be a binary floating point number with p significant
bits, where p >= 2.  Compute s1 = sqrt(a) rounded to q significant bits
and s2 = (s1) rounded to p significant bits.  Then s2 is the same as
sqrt(a) rounded to p significant bits provided q >= 2p+2; if q = 2p+1,
then s2 differs from sqrt(a) rounded to p significant bits if and only
if a is of the form 2^2k * (1 - 2^-p) for some integer k.

Idea of proof: Double-rounding can foil correct rounding only when the
first rounding hits a "halfway case", i.e., a number exactly halfway
between two p-bit numbers, so that the round-ties-to-even rule rounds
the intermediate result in the same direction (up or down) as the ini-
tial rounding yielding a result with an error larger than half a unit
in the last place.  Thus we consider how close the square root of a
p-bit number can come to such a halfway case.  In particular, suppose
b is such a halfway number, scaled by a power of two to be an odd int-
eger.  Let a be the p-bit number whose square root b approximates.  If
|sqrt(a) - b| >= 2^-p, then when the square root is initially rounded
to any precision carrying at least 2p+1 bits, it will not hit the half-
way case b, and the second rounding will give the correct p-bit result.
Therefore we show that |sqrt(a) - b| < 2^-p only in two cases; in one,
the round-ties-to-even rule actually rounds the intermediate result
in the correct direction, but in the other, double-rounding gives the
wrong result.  (In either case, rounding initially to at least 2p+2
bits would resolve the tie and yield the correct result.)

Proof: Without loss, assume 2^2p <= a < 2^{2p+2}.  Let b be any odd
integer satisfying 2^p < b < 2^{p+1}.

Lemma 1: If |sqrt(a) - b| < 2^-p then |a - b^2| < 4.

Proof of Lemma 1: |sqrt(a) - b| < 2^-p  =>

	b - 2^-p < sqrt(a) < b + 2^-p  =>

	-2^{1-p}b - 2^{-2p} < a - b^2 < 2^{1-p}b - 2^{-2p}.

Now b <= 2^{p+1} - 1 so we have

	-4 < -4 + 2^{1-p} - 2^{-2p} <= -2^{1-p}b - 2^{-2p},  and

	2^{1-p}b - 2^{-2p} <= 4 - 2^{1-p} - 2^{-2p} < 4

and therefore -4 < a - b^2 < 4, or |a - b^2| < 4 as claimed.

Lemma 2: |a - b^2| < 4 if and only if a = b^2 - 1 and either b =
2^p + 1 or b = 2^{p+1} - 1.

Proof of Lemma 2: Since a is a p-bit number and a >= 2^2p, a is an int-
eger multiple of 2^{p+1}.  As p >= 2, a is a multiple of 8, so a - b^2
is congruent to -b^2 mod 8, and since b is an odd integer, the latter is
congruent to -1 mod 8.  Therefore |a - b^2| < 4  <=>  a - b^2 = -1  <=>
a = b^2 - 1 = (b - 1)(b + 1).  Now since a is a multiple of 2^{p+1} and
one of (b - 1) or (b + 1) is an odd multiple of 2, the other must be a
multiple of 2^p.  Given that 2^p < b < 2^{p+1}, this can only happen
when b = 2^p + 1 or 2^{p+1} - 1, as claimed.

Completion of proof of Proposition: By Lemma 2, unless either a =
2^{2p} + 2^{p+1} or a = 2^{2p+2} - 2^{p+2}, we have |a - b^2| >= 4,
so by Lemma 1, |sqrt(a) - b| >= 2^{-p}, and thus if sqrt(a) is rounded
initially to at least 2p+1 bits, it will not round to b; since b was
arbitrary, this means the initial rounding of sqrt(a) will not hit any
halfway case, so the second rounding to p bits will produce the correct
result.  Now we need only consider the two remaining cases.

The case a = 2^{2p} + 2^{p+1} is equivalent by scaling to the case
a = 1 + 2^{1-p} (one plus one ulp).  From the Taylor series sqrt(1 + x)
= 1 + x/2 - x^2/8 + x^3/16 . . ., we see that sqrt(a) < 1 + 2^-p, so if
s1 = sqrt(a) rounded to q bits, then s1 <= 1 + 2^-p for any q > p.  The
round-to-nearest-even rule then guarantees that s1 rounded to p bits
will yield s2 = 1.  But in this case, sqrt(a) rounded correctly to p bits
is also 1, so we will obtain the correct result.

The case a = 2^{2p+2} - 2^{p+2} is equivalent by scaling to the case a =
1 - 2^-p (one minus half an ulp).  In this case, sqrt(a) < 1 - 2^{-p-1}
- 2^{-2p-3}, so if s1 = sqrt(a) is rounded to at least 2p+2 bits, then
s1 <= 1 - 2^{-p-1} - 2^{-2p-2}, and thus s2 = 1 - 2{-p-1}, the correctly
rounded result.  But if s1 = sqrt(a) rounded to exactly 2p+1 bits, then
s1 = 1 - 2^{-p-1}, and by the round-ties-to-even rule, s2 incorrectly
rounds up to 1.  This completes the proof.

Douglas M. Priest
SunSoft Floating Point and Performance Group
Sun Microsystems, Inc.
(but only speaking for myself)



More information about the Numeric-interest mailing list