Comments on Sun's Proposal for Extension of Java Floating Point in JDK 1.2

Wed Aug 12 01:23:41 PDT 1998

PEJFPS = Proposal for Extension of Java(TM) Floating Point Semantics,
Revision 1

> Tim Lindholm's response to Samuel Figueroa's comments on PEJFPS

[snip]

> As for fused multiply-add, our reading of the proposal and our intent
> with it is to allow the PPC to use its fused multiply-add in implementing
> widefp.  If that proves to be false for some reason we haven't understood,
> we would want to fix that.

The only explicit mention of fused mac in PEJFPS is on page 30; "For
instance CPUs with fused multiply and add instructions may be unable
to use those instructions when implementing FP-strict operations."  It
would be possible to obliquely interpret PEJFPS sections 15.16 and
15.17.2 as allowing fused mac, but it is not clear from the text.
PEJFPS proposes allowing the programmer to inquire whether or not
extended formats are in use by testing various variables holding the
minimum/maximum exponent of the formats used in FP-wide contexts.
This mechanism does not work well with indicating whether fused mac is
being used since the extra precision and range of fused mac is
internal to the instruction.  To more readily indicate fused mac can
be used, a new "boolean" system property called, say, fp.mac could be
added to Java.  (System properties are used to indicate
platform-dependent details, such as the proper line termination
character sequence; system properties are accessed with the
System.getProperties method.)  Additionally, if fused mac were to be
allowed, the conditions under which a successive add and mul can be
fused should be described.  In other words, the operations that could
possibly be fused should be inferable from the source code and not
entirely dependent on the cleverness of the optimizer.

[d. SPARCs as model IEEE 754 compliant processors]

> All seems pretty true.  To the extent (d) is true, it's not as
> nepotistic as it might appear in retrospect.  At the time this was
> decided, Java was Oak and the widely cross-platform Internet play that
> led to Java hadn't been conceived of yet.  Oak only ran on SPARC and
> using IEEE 754 was the natural choice there.  New processors were
> mostly implementing IEEE 754.

The x86 line of processors conforms to IEEE 754.  Unlike SPARC's, the
x86 line of processors has hardware support for the IEEE 754
recommended doubled extended format.  The x86 rounding only the
significand when rounding to double precision is part of IEEE 754
(section 4.3).  Since the floating point of the 8087 formed the basis
for IEEE 754, it would be odd if the x86 processors did not conform to
the standard.  The SPARCs I'm familiar with have very poor support for
IEEE 754 mandated subnormals; a single operation on subnormal numbers
can take 10,000 cycles (about 3,333 times longer latency than an
operation on normal operands).

SPARCs do implement rounding modes and sticky flags, *required* features
of IEEE 754 forbidden or omitted by Java (The Java(TM) Language
Specification, section 4.2.4).

Since Java is no longer Oak (and hasn't been Oak for several years),
it is difficult to justify Java's floating point semantics by citing
constraints inherited from Oak.

[snip]

> But at least predictability is traded off explicitly in favor of
> performance.

In PEJFPS, existing code implicitly has FP-wide semantics by default
(perhaps contrary to the intention of the code's author).

[tighter expression evaluation rules for FP-wide code.]

> Unfortunately, doing this is constraining to implementors on Intel.
> For instance, compiler writers need to spill to memory, and don't want
> to spill all 80 bits of intermediate results.  My assumption has always
> been that FSTP to 80 bits in memory is very slow, but we should know
> that before we write this off.

>From the "Pentium(R) Processor Family Developer's Manual: Volume 3:
Architecture and Programming Manual," assuming everything is in the
cache, storing a 32 or 64 floating point bit value takes 2 cycles and
storing an 80 bit value takes 3 cycles; loading a 32 or 64 bit value
takes 1 cycle and loading an 80 bit value takes 3 cycles.  I assume
the timings for the PPro and subsequent processors are similar.
Generally, the newer Intel processors have better floating point
pipelining than previous ones.

>  We've been pushed very hard in favor of performance at the cost of
> predictability.  Note also that making widefp code predictable in
> this way seems to throw off other things, like the ability to use
> PPC's fused multiply-add.

On existing relevant architectures, the one with double extended (x86)
and the ones with fused mac (PowerPC, et al.) are disjoint.
Therefore, having somewhat separate rules for using extended precision
and fused mac in widefp contexts may be workable.  (Unless Merced has
both features???)

[snip]

> The (proposed) JVM Spec would always treat implicitly widefp methods as
> widefp, although the current proposal permits a lot of rounding in a
> widefp method.  The model should never be that widefp sometimes means
> strictfp and sometimes widefp.

While that may be the intention, it is not mandated by the proposal.

>  What widefp means is chosen on startup.  However, it is the case
> that sometimes widefp does rounding for its own purposes that is
> like strictfp.

[why not allow double rounding on gradual underflow in strict mode?]

> We stewed over this but ended up thinking that we had to stay
> backward compatible when strict.  We didn't think this small
> extension would satisfy performance needs (given the store-load
> costs) but would break our promises and still wouldn't give
> bit-for-bit.

Has anyone actually complained about exact reproducibility being
violated by the double rounding on underflow discrepancy exhibited by
existing Java VMs on the x86?

Do Java 1.2's transcendental functions conform to the language
specification instead of using whatever transcendental functions an
underlying C compile provides?

Sun's Java 1.1 JDK uses the transcendental functions of the underlying
C compiler.  On the x86, C compilers often use the corresponding
hardware instructions on the chip.  These instructions can give
different answers than the FDLIBM algorithms mandated by Java (The
Java(TM) Language Specification, section 20.11).  Moreover, the
transcendental functions on different x86 chips have been implemented
using at least 5 distinct set of algorithms; therefore, the same
instruction with the same argument on different x86 chips can also
give different answers.  On platforms other than the x86, FDLIBM or
some other implementation of the transcendental functions may be used.
The discrepancy between these different versions of the transcendental
functions is erroneously ascribed to "rounding differences" in a Java
1.1.x bug report (bug ID 4058551).

If Sun is committed to providing exactly-reproducible Java floating
point, is a very visible and acknowledged Java-standard-violating
library bug going to be fixed before the rarely encountered
perhaps-never-complained-about doubled rounding on underflow
situation?

[add a new data type to access extended precision where available]

> We considered this but didn't think we could do it at the time, or while
> retaining backward compatibility.  Licensees didn't want people to have to
> write new code to take advantage of the increased performance.

Adding a new type would increase backwards compatibility by better
preserving the semantics of existing code.  As discussed in previous
messages, Java's default strict semantics could be somewhat loosened
(allowing extended exponent range for anonymous values) in a way that
improves performance and largely retains reproducibility.

In general, some Java features have not been backwards compatible.
Obviously, API's added in Java 1.n are not available in Java 1.(n-1).
However, the situation is more fundamental than that.  For example, a
Java 1.0 VM cannot provide reflection even if given the classes.zip
file from Java 1.1.  Reflection (and some other Java 1.0 -> Java 1.1
changes) require JVM changes.  Therefore, any code that uses such
features is not backwards compatible.  Allowing double rounding on
underflow is likely to be much less visible a change than these other
incompatibilities.

-Joe Darcy
darcyacs.berkeley.edu