[Cfp-interest 2108] supernormal numbers (was: WG14 IEEE 754-C binding meeting minutes) 2021/08/17

Vincent Lefevre vincent at vinc17.net
Wed Aug 18 08:25:43 PDT 2021


On 2021-08-17 14:02:43 -0500, Rajan Bhakta wrote:
>     Number classification and normal numbers (See CFP2091-3, CFP2096).
[...]
>       Fred: There is also supernormal (double double has it). Do you know
> if DBL_MAX + DBL_MAX is a finite number instead of an infinity.

Because the absolute value of the second component of a double-double
number must be less than or equal to 1/2 ulp of the first component,
DBL_MAX + DBL_MAX is an invalid representation (trap representation
in the C terminology).

However, due to a representation issue with the maximum exponent,
the maximum representable finite floating-point number LDBL_MAX is
strictly larger than the maximum normalized floating-point number
(which is close to DBL_MAX/2).

With GCC on PowerPC (double-double), where the precision is 106,

LDBL_MAX = 0x1.fffffffffffff7ffffffffffff8p+1023

which is
           0x1.fffffffffffffp+1023
         + 0x0.00000000000007ffffffffffff8p+1023

though I would rather expect DBL_MAX + DBL_MAX * DBL_EPSILON / 4
= 0x1.fffffffffffff7ffffffffffffcp+1023, i.e. with an additional
trailing 1. I don't see why this would be a trap representation.

However, the maximum normalized floating-point number is
0x1.ffffffffffffffffffffffffff8p+1022, or equivalently,
0x0.ffffffffffffffffffffffffffcp+1023, which is

           0x1.0000000000000p+1023
         - 0x0.000000000000000000000000004p+1023

There's another issue:

  FLT_MAX_EXP
  DBL_MAX_EXP
  LDBL_MAX_EXP

are defined in th current C2x draft (N2596) as

  maximum integer such that FLT_RADIX raised to one less than that
  power is a representable finite floating-point number, e_max

while e_max was first introduced in the floating-point model, i.e.
for *normalized* numbers. A solution would be to introduce

  FLT_NORM_MAX_EXP
  DBL_NORM_MAX_EXP
  LDBL_NORM_MAX_EXP

which would follow the floating-point model, and

  FLT_MAX_EXP
  DBL_MAX_EXP
  LDBL_MAX_EXP

which would allow potentially larger values, possibly with a relaxed
definition (the current one is OK when there are no supernormal numbers,
but may be artificial otherwise).

FYI, my original GCC bug report (submitted before the defect report,
thus be careful with some obsolete discussion):

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61399

-- 
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


More information about the Cfp-interest mailing list