More on input: a harder problem may be easier to solve

Tue Apr 2 22:46:22 PST 1991

This sketches an approach related to, but apparently both more useful
and simpler than, the ones mentioned before.  As always, I'm tossing
this out in the hopes that someone with time & interest can do something
useful with it; I can't make the time even to write a program to see
where this leads, let alone spell everything out in detail.

Backing off from the hitherto exclusive focus on nearest/even endcases,
what would we *really* want from an "extra precision" approach to input?
Getting the nearest/even endcases right is only a start -- we'd also
want to handle the other rounding modes correctly, and get the inexact
flag set correctly (something that, e.g., Clinger's algorithm can't do
when it multiplies by an approximation to a negative power of 10 -- and
I don't know about you, but for the sake of speed I *really* want to
avoid division when faking an extended precision).

It's pretty clear that we could have all this, without needing arbitrary
precision, if we had enough precision in the working result (WR) so that
the following held ("iff" means "if & only if"):

1) The WR has excess = 0 iff the infinitely precise result (IPR) has
   excess 0.

2) The WR has excess in (0,.5) ulp iff the IPR does.

3) The WR has excess = 0.5 ulp iff the IPR does.

4) The WR has excess in (.5,1) ulp iff the IPR does.

Notes:

  - The thrust of #1 is to enable setting the inexact flag correctly,
    and in the non-nearest/even rounding modes to prevent wrong
    roundings for numbers that are exactly representable, or close to
    it.

  - The thrust of #3 is to make double-rounding errors impossible under
    nearest/even rounding.

  - It's possible that #2-#4 could be weakened somewhat, perhaps
    allowing a smaller working precision, if the double-rounding cases
    that could then arise would all provably work out right (by
    accident) anyway.  I'm sure Jon will pursue that with vigor, but I
    don't care about it <grin>.

  - I realize the 8 conditions here (each "iff" yields 2 conditions) are
    not mutually independent.  It doesn't matter.

Under any non-insane approach, #2 and #4 will take care of themselves if
#1 and #3 are provided for, so I'll ignore #2 and #4.  Half of #1 and #3
are easy to get:  if a number is exactly representable or exactly at a
halfway point, and we get a working result good to within a half ulp
(wrt the working result's *extended* precision), the excess in the
working result must be 0 or 1/2 ulp (wrt to the final desired precision)
respectively (else the working result would not be within a half ulp
of the IPR, and we said it was).

So the real problems here are providing for the "only if" halves of #1
and #3:  guaranteeing that the WR doesn't have an excess of exactly
0 or 1/2 ulp (wrt the final desired precision) unless the IPR does.

So, if the approach is non-insane, and the IPR does not have an excess
of 0 or 1/2 ulp, how can the WR wind up having an excess of 0 or 1/2
ulp?  The only way is via rounding to the working precision:  if the IPR
has an excess *close* (but not exactly equal) to 0 or 1/2 ulp, and the
working precision is too small to represent the difference between the
actual excess and 0 or 1/2 ulp, the best the working precision can do is
to approximate the true excess *by* an excess of exactly 0 or 1/2 ulp.

In other words (& it's probably more obvious from a picture!), the
troublesome IPR cases look like:

xxx | 10000000...0000001yyy  
xxx | 01111111...1111110yyy
xxx | 00000000...0000001yyy
xxx | 11111111...1111110yyy

where the "xxx" represent (almost) arbitrary bitstrings of length #-of-
bits-desired-in-the-end, the "yyy" represent (truly) arbitrary
bitstrings, the "|" is just a mark to visually separate the bits we want
to retain in the end from the excess bits in the IPR, and the ellipsis
means to repeat the bit to its left some arbitrary number of times.  The
first two patterns are troublesome for condition #3, and the last two
for #1.

If we shift our view of these troublesome cases "to the left" by one
bit, it promises to make life a lot easier:

xxx1 | 0000000...0000001yyy  
xxx0 | 1111111...1111110yyy
xxx1 | 0000000...0000001yyy
xxx0 | 1111111...1111110yyy

I.e., the potentially troublesome cases are all & only those that have a
long contiguous string of 0's or 1's starting *two* bit positions beyond
the end of the final desired precision.  If we can find the longest such
contiguous string possible, then a working precision that's good to
within a half ulp wrt a couple bits beyond that will never get confused
about whether it is or isn't exactly at a 0 or half ulp (wrt the final
desired precision) boundary (clearly, because the working precision is
then wide enough to "directly see" that it's not at such a boundary in
the worst possible case).

As in earlier msgs, there *isn't* a worst case unless we agree to limit
the number of digits we'll look at.  Once we stick a limit on, there is
a worst case.  Finding the worst case proceeds very much as before,
except the black magic of "shifting the view left a bit" yields a
simpler functional form to minimize.  In the past, giving "general
hints" about this method seemed to create more confusion than
understanding, so this time I'll go to the opposite extreme and show all
the grunt work for a specific toy example; there is nothing here that
doesn't generalize straightforwardly to realistic examples.

So, suppose we're only looking at 3 decimal digits in the inputs, and
that our target is an IEEE-like format with only 8 bits of precision
(you're on your own for denorms: I'm assuming the exponent field is
unbounded <grin>).  Under these conditions, what's the "worst case" that
can map into a machine number with (unbiased) exponent -13?

Note:  to get *the* worst case you have to repeat this argument for each
possible binary exponent -- that's easy & fast if you write a program to
do it (although see later).

First we figure out which inputs can map into 2#1.xxxxxxx * 2^-13.  This
may actually depend in part on the final rounding mode, but we'll be a
little bizarre and try to cover them all at once (we won't get away with
it, though).  And this is the only place where the next-highest-
representable-float-is-twice-as-far-away-as-the-next-lowest-representable-
float nightmare (at powers of 2) crops up (so those who were worried
because it didn't show up explicitly in previous discussions can relax:
it plays a part, but a minor one).

So, under +oo rounding, any number greater than 2#1.1111111 * 2^-14 will
map into the 2#1.xxxxxxx*2^-13 form, and under -oo rounding, any number
smaller than 1.0000000 * 2^-12 will map into that form.  Converting
those bounds to decimal and doing the floor and ceiling bits in the
obvious ways, the inputs we're interested in are in the range 1.22e-4 ..
2.44e-4 (any 3-digit input smaller than 1.22e-4 is guaranteed to map into
something outside the desired range regardless of rounding mode, and
ditto for any 3-digit input larger than 2.44e-4).

Restating that a bit, the inputs of interest are of the form

(s * 10^-2) * 10^-4 = s * 10^-6

for integer s in [122,244].

Multiplying by 2^13/2^13 gives that the input is

(s * 10^-6 * 2^13) * 2^-13

We would *like* to say that the s*10^-6*2^13 part is in [1.,2.) now, but
alas it isn't at s=122.  That's what we get for trying to do it all at
once <grin>.  Seriously, s=122 is a special case:  put it off to the
side and don't forget to check it later (this is one of the endless
niggling details I didn't spell out in earlier msgs).

Now that we're limiting ourselves to s in [123,244], the s*10^-6*2^13
part is indeed always in [1.,2.), so the (infinitely precise)
significand of the input is s*10^-6*2^13 and the (unbiased) exponent is
-13.

Now the first *nine* bits of the infinitely precise significand (we're
doing the "shift left a bit" trick described earlier) are just

int( (s*10^-6*2^13) * 2^8 ) =
int( s*10^-6*2^21 ) =
int( (s*2^21)/10^6 ) =
int( (s*2^15)/5^6 )

We don't care what those are, but we do care about the excess (the stuff
beyond the 9 bits), and the excess is exactly

(s*2^15 mod 5^6) / 5^6 ulp (wrt the 9th bit in the infinitely precise
		            significand)

and for a long string of 0's or 1's to start off the excess it's
necessary & sufficient that the numerator either be close (but not
exactly equal) to 0 (then we'll get a long string of 0's) or close (but
not exactly equal) to 5^6 (then we'll get a long string of 1's).  Hence
the whole problem boils down to finding the s in [123,244] that
minimizes (in the modular sense) the function

s*2^15 mod 5^6 =

s*1518 mod 15625

The payoff <grin> is that we do *not* have to deal with a non-zero
offset here as we did before (the offset before came from restricting
our view to the 1/2 ulp endcases -- by looking at more, we actually need
to solve less); the theory in Knuth Vol 2 Ed 2 Sec 4.53 exercise 43 is
directly applicable as-is.

Plowing thru that theory,

1518/15625 = / 10, 3, 2, 2, 3, 6, 4 /   (rather unusual, really!)
			    ^
			    could have stopped here (see below)

and the convergents are

 1 / 0
 0 / 1
 1 / 10
 3 / 31
 7 / 72
 17 / 175
 58 / 597     could have stopped here, since we're outside [123,244] now
 365 / 3757
 1518 / 15625

The denominator of the convergent with the largest denominator in
[123,244] is 175, so we're *guaranteed* that the "worst case" in this
range is at s=175.  No search was needed, just a handful of convergents
(note that there's no need to compute the numerators of the convergents
either -- just did that above to make it clearer).

You can easily check (through exhaustion) that 1.75e-4 really is the
worst 3-digit input mapping into something of the form 2#1.xxxxxxx*2^-13;
in fact

1.75e-4 = 2#1.0110111100000000011010001101110... *2^-13
	            x	       y

where "x" marks the 8th (last retained) bit, and the worst case in this
range happens to be a little bit above a 1/2 ulp trouble spot.  Here y
marks the bit (the 11th excess bit) that finally distinguishes this from
a true 1/2 ulp case, and so obtaining a working result good to within
1/2 ulp wrt the 11th excess bit *suffices* to satisfy conditions #1 thru
#4 for *all* inputs that map into 2#1.xxxxxxx*2^-13.

Remaining problems:  We're left trying to minimize a function of the form
B*s mod C over s in [slo,shi].  If some convergent to B/C has a
denominator in [slo,shi] (as happened above, and this is efficiently
determinable for even huge B and C), then it's easy.  But if no
convergent has a denominator in range, it may not be so easy:

   E.g., if slo > C then it's impossible (the denominator of the last
   convergent is always C).  You can usually worm around that by using
   that s+i*C is congruent to s for any integer i, so just start at the
   penultimate convergent and work backwards, until finding one whose
   denominator is a multiple of C away from being in [slo,shi].

   Note that a degenerate form of the problem comes up if you try to
   apply this to a binary range and number-of-decimal-input-digits such
   that all the relevant decimal inputs are exactly representable.
   E.g., all the 3-digit decimal inputs of the form x.xxe2 are exactly
   representable as IEEE doubles.  This isn't a problem except in that
   you need to recognize it when it happens.

   If the tricks above don't help, the theory in the proof of Knuth's
   theorem 6.4S can be used to get at the "fine structure" of the bad
   cases (in effect, it gives you several more choices for "s" to try,
   and with some luck you'll find one in [slo,shi] -- although it seems
   darned unlikely that *the* worst case over *all* the binary exponent
   ranges could come from anything other than a direct hit).

   If 6.4S doesn't help either, I still don't know a clean theory to
   find the worst case.  The "hill-climbing" strategy hinted at in
   earlier msgs is equally effective here as it was there, but the
   details involved in getting that strategy to work correctly would
   make all of this memo look tiny (i.e., I'm never going to be able to
   make time to describe it in the foreseeable future -- & I'm 98% sure
   there's an easier way to do it anyway).

More "head work" would probably have a good payoff here:  ask your
friendly local number theorist to come up with a fast way to solve the
abstract minimization problem (and for God's sake don't tell them it has
an *application*! <grin -- but some number theorists take pride in being
divorced from the "real world">), and I'm sure there's a clean theory
hiding here waiting for them to find.  Or do it yourself, if you have
the time -- could be fun.

details-details-ly y'rs  - tim

Tim Peters   Kendall Square Research Corp
timaksr.com,  ksr!timaharvard.harvard.edu