Note on Java Numerics

Mon Jan 27 17:02:46 PST 1997

====================================================================
A Note On Java Numerics                               Jan. 25, 1997
====================================================================

Overview

Java promises to be a vastly important programming language, even
in such arenas as hard-core numerical computation.  Its current design
is seriously flawed and detrimental to the numerical health of millions
of potential users.  It neither captures the spirit of the IEEE standard
for floating point arithmetic nor reflects the years of work of the
ANSI C standards group to undo the numerical shortcomings of C's
original, PDP-11-centric design.

Java's designers claim to aspire to bit-identical results across all
Java implementations.  But besides being mathematically intractable,
that goal is commercially infeasible.  Just as Gresham observed that
good money will drive out the bad, so will faster computations -- but
for a bit or two of rounding error -- drive out the best computations.
Java's designers, if anyone, should understand the importance of benchmark
speed in computer sales.

Instead of convening a council to proclaim the one "right" answer to any
computation, Java's design team would serve the computing community better by
providing the most robust numerical environment possible across a diversity
of architectures.  Good tools will lead to good results sooner than
arbitrary decisions about right and wrong results.  Today, the Java
specification favors processors like Sun's Sparc, at the expense of
millions of users of other hardware.

In "The Java Language: A White Paper," Sun describes Java as a, "simple,
object-oriented, distributed, interpreted, robust, secure, architecture-
neutral, portable, high-performance, multithreaded, and dynamic language."

This note explores Java from a numerical standpoint, especially with
regard to its being architecture neutral, portable, and high performance.
It suggests that by using experience gained by the C standards group,
the Java design team could make Java a more truly open design.  There
is still time for Java to be numerically robust over a wide range of
processors.

Two Examples

Consider this simple piece of a larger computation:

double h, a, b, c, d;
h = (a * b) / (c * d);

Programmers familiar with the Pentium processor might expect expression
evaluation to proceed in the form

fld.d	a	; push a onto FPU stack
fmul.d	b	; a * b on stack
fld.d	c	; push c
fmul.d	d	; a * b and c * d on stack
fdivp		; a * b / c * d on stack
fstp	h	; pop and store as double

This is an excellent application of the floating point evaluation stack
in the x86 architecture.  And because stack values are stored to 11 more
significant bits and with substantially wider exponent range than 64-bit
double values, rounding error is reduced and the possibility of intermediate
overflow or underflow is eliminated.

A careful reading of "Java: The Lanuage Specification" by Gosling, Joy, and
Steele, however, reveals language suggesting that all intermediate results
should be stored to "double precision."  According to numerical aficionados
at Sun, the spec ought to say "double precision and range." This restriction
leads to this more cumbersome evaluation on some Pentium systems:

fld.d	a
fmul.d	b	; a * b on stack
fst.d	temp1	; coerce a * b to double -- EXTRA STORE
fld.d	c	; EXTRA LOAD
fmul.d	d	; c * d on stack
fst.d	temp2	; coerce c * d to double -- EXTRA STORE
fld.d	temp1	; EXTRA LOAD
fdiv.d	temp2	; temp1 / temp2 on stack
fst.d	h

The cost of converting intermediate results to double in this simple
calculation is writing two double values to memory and then
immediately reading them back -- nearly doubling the memory traffic
and increasing the instruction count by half.  This is not what designers
of the x86 FPU (nor the IEEE floating point standard) had in mind.

Here is an example on the PowerPC processor.  Numerical approximations
are often structured to have the form (base value) + (residual), where
the base value is a fast but rough approximation, in turn refined by the
smaller residual.  Such approximations lead to expressions of the form

double y, base, x, h;
y = base + (x * h);

PowerPC enthusiasts crave such expressions because they are handled so
well by a single instruction

; Assume floating registers fr1 = base, fr2 = x, and fr3 = h.
; Compute fr0 = y = base + (x * h)
fmadd	fr0,fr1,fr2,fr3	; fr0 = f1 + (f2 * f3)

The power of the "fused multiply-add" instructions lies in the evaluation
of the product (x * h) with no rounding error before this value -- to a full
106 significant bits -- is added to the value of base in fr1.

The Java spec, however, would seem to imply that the product (x * h) must
be explicitly rounded to double before it is added to base, leading to
the alternative sequence

; fr1 = base, fr2 = x, and fr3 = h.
; Compute fr0 = f1 + (f2 * f3)
fmul	f2,f2,f3	; f2 = f2 * f3, replaces x -- EXTRA ROUND TO DOUBLE
fadd	f0,f1,f2

Java has doubled the number of instructions, added a temporary value,
and forced one gratuitous rounding on the PowerPC.

Java Is Not Architecture Neutral

Although by their nature numerical issues can become painfully murky
for some, one thing is certain: these simple examples illustrate that Java
as currently specified is NOT architecture neutral.  Users of either
Pentium or PowerPC architectures will suffer a performance penalty
compared to Mips or Sparc users, whose machines perform only ordinary
operations and only on float and double values.

The notion of "architecture-neutral" has other interpretations.  Java's
designers might say that achieving the same results (or nearly so) on all
machines makes the language architecture neutral, but this paper argues below
that this goal is ultimately flawed and that the price -- lower performance,
less accurate results -- is not nearly worth the penalty to users of some
prevalent architectures.

Java Does Not Achieve High Performance

The examples above show that Pentium and PowerPC users pay twice: less
accurate results because of imposed intermediate operations and lower
performance due to time wasted on the unwanted operations.  The ANSI
committee revising the C language standard has wrestled with these issues
for years.  They have decided on an approach that is truly architecture
neutal and high performance.  Consider this example:

/* This is C9X, but could as well be Java. */
double sum, x[100], y[100];
double_t t;		/* fastest type at least as wide as double */
int i;

t = 0.0;
for (i = 0; i < 100; i++)
	t += x[i]*y[i];
sum = t;

Such loops dominate compute-intensive numerical applications.  C9X
gives explicit, portable access to system-dependent types that will
produce the best results fastest.  Pentiums make best use of their
stack.  PowerPCs make best use of fused multiply-add.  Both produce
results at least as good as -- in the case of Pentium usually rather
better than -- a vanilla float/double architecture processor.

This programming style leads to the best code across a diversity of
machines.  By contrast, the current Java spec would force every product
and sum out of the FPU stack into memory, significantly slowing the
loop and nullifying the advantages of extra range and precision.

Java Floating Point Is Not Strictly IEEE Standard -- Unnecessarily

In its first incarnations, Java is an interpreted language.  Its code
is compiled down to compact "byte codes", which can in turn be sent
across a network and executed on diverse computers.  To have some
prospect of portability of algorithms and data, Java's designers chose
to adopt part of the IEEE standard 754 for binary floating point
arithmetic.

The standard ensures bits in a 32-bit float value and in a 64-bit
double value will have the same meaning on all Java systems (modulo
byte order, that is).  It also ensures that arithmetic will be performed
as if with unbounded range and precision, with results rounded to the
appropriate representable value.

Java deviates from the letter and the spirit of IEEE arithmetic in
critical ways.  It omits any state associated with floating point arithmetic,
preventing programmers from choosing a rounding direction from any of
the four (required) IEEE options, and preventing programmers from testing
and altering (required) IEEE sticky exception flags.

The desire of Java's designers to attain identical results on all machines
is not at all in the spirit of the IEEE standard.  The standard actually
recommends -- with its careful use of the word "should" -- the implementation
of a double-extended format on float/double machines, yet Java's designers
would preclude its use.  The IEEE recommendation was based on years of
experience that showed how a little extra hardware could go a long way
computationally.  The recommended extensions of range and precision --
supported by the x86 and 68K architectures -- are the preferred, not
prohibited, mechanisms for expression evaluation.  Aside from their
obvious use to experts, they also defend less sophisticated programmers
who often transcribe numerically dubious mathematical formulas from texts.

The examples above illustrate that Java prevents effective use of extended
precision and range on processors whose users have paid for those features.
In contrast, designers of IEEE-754 were deliberately architecture-neutral,
though it might be that "architecture-open" is a better expression.
The standard allows
* float-only systems, for the low end
* extended formats with extra range and precision, for users who care to
  pay for their benefits
* different implementations of underflow, in which the differences are
  computationally and logically negligible.

It's a pity that Java is so restrictive in light of progress in the new
C9X standard, in which great care has been taken to make the most powerful
features of the IEEE binary floating point standards portable at last.

Portability != Reproducibility

Java's designers might argue that their restrictive specification for
floating point arithmetic approaches reproducibility of results across
diverse architectures.  Identical floating point results on all machines
yield the ultimate portability.  But at what cost?

Truly identical results are possible only if every library function
available to applications yields identical results, which is only
practical if all implementations are the same.  Therein lies the
flaw.  As long as competition (or past practice) motivates a variety
of architectures, computations will be best done in different ways
on different machines.  Trying to enforce specific algorithms for
important computations across all machines will ultimately fail
because users (and salespeople) want to eke all the speed possible
out of a given architecture.

For a concrete case, consider the third example presented earlier --
computing an inner product of two vectors, that is

x[0]*y[0] + x[1]*y[1] + x[2]*y[2] + . . . + x[N-1]*y[N-1]

where x and y are double arrays.  Consider the possibilities for a
function that might be found in the widely used LINPACK or LAPACK
libraries.

1. The current Java spec suggests that all intermediate operations
   should produce double results.  Sparc and Mips processors work
   this way.

2. PowerPC users might expect fused multiply-add to speed the loop.

3. Pentium users would expect intermediate operations to enjoy
   extended range and precision.

4. Followers of the work of Kulisch and Miranker would demand that
   the sum be computed with JUST ONE rounding error -- a facility
   provided by the ACRITH package in some IBM hardware and other
   software.

Option 4 is the most accurate, but its cost is high given the range
of the IEEE double format.  Options 1-3 are fastest on their
respective architectures, with 3 almost certainly likely to be the
most accurate of them.

Who will decide what's right and how much who must pay to get this
right result?

The question is even more subtle in the realm of transcendental
functions like cos() or exp(), some of which require argument
reduction, too.  Not even Prof. W. Kahan of U. C. Berkeley,
who received the ACM Turing Award for his work on the IEEE standard
and his numerical nit-pickiness, could want to sit on such a
Numerical Governing Board.

And if the issue isn't sufficiently intractable mathematically,
consider the forces of the marketplace, in which faster code,
often at the expense of some accuracy but quite often without
sacrificing key properties like monotonicity, can create its
own market.

Identical results from any but the simplest expressions are
a hopeless cause.  An attempt at such a rigid standard could
lead to a specification conformed to only by the implementation
blessed with the choice of the "right" result of any computation.

Java Can Have Robust, Portable Floating Point Computation

As specified today, Java makes numerical sense only on processors
like Sun's Sparc that support only basic arithmetic on float and
double operands; users of other processors suffer one form of
degradation or another.

Ultimately, one hopes users will strive for a correct answer,
not the same answer.  In over a decade since its adoption, IEEE-754
has driven the industry toward common data formats and higher
quality results.  Over the past seven years, work on the new
C9X standard has exposed techniques to make the features of IEEE
arithmetic available portably.  It's time for the Java community
to take note of that progress, to open the standard to the real
diversity of existing, standard floating point architectures, and
achieve a truly high performance, portable design that is
architecture neutral in a numerically sensible way.

Jerome Coonen
C-Labs
4035 Orme St.
Palo Alto, CA  94306
jeromeabe.com

Background

Jerome T. Coonen received his Ph.D. in mathematics from U. C. Berkeley
for work with W. Kahan on the design and early implementation of the
IEEE binary floating point standard.  He worked at Apple Computer
for a decade and nowadays consults on floating point issues with
Exponential Technology, Metrowerks, Apple, Microsoft, and SunSoft.