Square root and double rounding

Sun Jun 25 11:36:01 PDT 1995

Samuel Figueroa asks whether the square root of a p-bit floating point
number, rounded first to 2p+1 bits, then to p bits, must deliver the
same result as rounding the square root directly to p bits, provided
all rounding is performed according to the IEEE 754 default round-to-
nearest-even specification.  I believe the answer is no, in general.
More specifically, assuming all rounding is performed according to the
IEEE 754 round-to-nearest-even rule, I claim:

Proposition: Let a be a binary floating point number with p significant
bits, where p >= 2.  Compute s1 = sqrt(a) rounded to q significant bits
and s2 = (s1) rounded to p significant bits.  Then s2 is the same as
sqrt(a) rounded to p significant bits provided q >= 2p+2; if q = 2p+1,
then s2 differs from sqrt(a) rounded to p significant bits if and only
if a is of the form 2^2k * (1 - 2^-p) for some integer k.

Idea of proof: Double-rounding can foil correct rounding only when the
first rounding hits a "halfway case", i.e., a number exactly halfway
between two p-bit numbers, so that the round-ties-to-even rule rounds
the intermediate result in the same direction (up or down) as the ini-
tial rounding yielding a result with an error larger than half a unit
in the last place.  Thus we consider how close the square root of a
p-bit number can come to such a halfway case.  In particular, suppose
b is such a halfway number, scaled by a power of two to be an odd int-
eger.  Let a be the p-bit number whose square root b approximates.  If
|sqrt(a) - b| >= 2^-p, then when the square root is initially rounded
to any precision carrying at least 2p+1 bits, it will not hit the half-
way case b, and the second rounding will give the correct p-bit result.
Therefore we show that |sqrt(a) - b| < 2^-p only in two cases; in one,
the round-ties-to-even rule actually rounds the intermediate result
in the correct direction, but in the other, double-rounding gives the
wrong result.  (In either case, rounding initially to at least 2p+2
bits would resolve the tie and yield the correct result.)

Proof: Without loss, assume 2^2p <= a < 2^{2p+2}.  Let b be any odd
integer satisfying 2^p < b < 2^{p+1}.

Lemma 1: If |sqrt(a) - b| < 2^-p then |a - b^2| < 4.

Proof of Lemma 1: |sqrt(a) - b| < 2^-p  =>

	b - 2^-p < sqrt(a) < b + 2^-p  =>

	-2^{1-p}b - 2^{-2p} < a - b^2 < 2^{1-p}b - 2^{-2p}.

Now b <= 2^{p+1} - 1 so we have

	-4 < -4 + 2^{1-p} - 2^{-2p} <= -2^{1-p}b - 2^{-2p},  and

	2^{1-p}b - 2^{-2p} <= 4 - 2^{1-p} - 2^{-2p} < 4

and therefore -4 < a - b^2 < 4, or |a - b^2| < 4 as claimed.

Lemma 2: |a - b^2| < 4 if and only if a = b^2 - 1 and either b =
2^p + 1 or b = 2^{p+1} - 1.

Proof of Lemma 2: Since a is a p-bit number and a >= 2^2p, a is an int-
eger multiple of 2^{p+1}.  As p >= 2, a is a multiple of 8, so a - b^2
is congruent to -b^2 mod 8, and since b is an odd integer, the latter is
congruent to -1 mod 8.  Therefore |a - b^2| < 4  <=>  a - b^2 = -1  <=>
a = b^2 - 1 = (b - 1)(b + 1).  Now since a is a multiple of 2^{p+1} and
one of (b - 1) or (b + 1) is an odd multiple of 2, the other must be a
multiple of 2^p.  Given that 2^p < b < 2^{p+1}, this can only happen
when b = 2^p + 1 or 2^{p+1} - 1, as claimed.

Completion of proof of Proposition: By Lemma 2, unless either a =
2^{2p} + 2^{p+1} or a = 2^{2p+2} - 2^{p+2}, we have |a - b^2| >= 4,
so by Lemma 1, |sqrt(a) - b| >= 2^{-p}, and thus if sqrt(a) is rounded
initially to at least 2p+1 bits, it will not round to b; since b was
arbitrary, this means the initial rounding of sqrt(a) will not hit any
halfway case, so the second rounding to p bits will produce the correct
result.  Now we need only consider the two remaining cases.

The case a = 2^{2p} + 2^{p+1} is equivalent by scaling to the case
a = 1 + 2^{1-p} (one plus one ulp).  From the Taylor series sqrt(1 + x)
= 1 + x/2 - x^2/8 + x^3/16 . . ., we see that sqrt(a) < 1 + 2^-p, so if
s1 = sqrt(a) rounded to q bits, then s1 <= 1 + 2^-p for any q > p.  The
round-to-nearest-even rule then guarantees that s1 rounded to p bits
will yield s2 = 1.  But in this case, sqrt(a) rounded correctly to p bits
is also 1, so we will obtain the correct result.

The case a = 2^{2p+2} - 2^{p+2} is equivalent by scaling to the case a =
1 - 2^-p (one minus half an ulp).  In this case, sqrt(a) < 1 - 2^{-p-1}
- 2^{-2p-3}, so if s1 = sqrt(a) is rounded to at least 2p+2 bits, then
s1 <= 1 - 2^{-p-1} - 2^{-2p-2}, and thus s2 = 1 - 2{-p-1}, the correctly
rounded result.  But if s1 = sqrt(a) rounded to exactly 2p+1 bits, then
s1 = 1 - 2^{-p-1}, and by the round-ties-to-even rule, s2 incorrectly
rounds up to 1.  This completes the proof.

Douglas M. Priest
SunSoft Floating Point and Performance Group
Sun Microsystems, Inc.
(but only speaking for myself)