Correctly-rounded base conversion effects on code size and speed

Tue Jun 12 09:57:52 PDT 1990

To quantify the results of my base conversion work,
I prepared the following report for internal Sun distribution.
Since the code in question will probably be publicly available
after I put it into a form that is mostly
portable among IEEE 754 implementations and less so among others,
the report seems suitable for distribution to this mailing list.
Note that the results for SunOS 4.1, Fortran 1.[34], and C 1.[01]
were obtained with early internal releases that may differ from
what was finally shipped.

---------------------------------------------------------------------------

This report describes recent improvements in floating-point base conversion
performance in Sun software. While base conversion does not dominate most
programs, nearly all application codes are affected by accuracy and performance
of base conversion.  Increasing the accuracy of base conversion reduces concerns
about getting "different" answers on different Sun platforms (thus reducing
support problems) and moves closer to the vision of distributed computing.
Increasing the performance of base conversion reduces complaints about poor
performance in general.  Users do not often think about base conversion, but
they do notice when it is wrong, or slow.

Who Cares About Base Conversion?

     When you want to read in a number in conventional
decimal notation and store it in internal binary format, or
when conversely you want to print an internal binary value
as an ASCII string of decimal digits, you need to do a
conversion between number representations in bases two and
ten.  Base conversion is not too difficult for integer for-
mats but gets more complicated when floating point is
involved; as with other floating-point operations, rounding
errors occur.  For instance, on conversion from internal
floating point to an external decimal representation, the
mathematical problem to be solved in default rounding mode
is

Given integers i and e, find integer(s)

        (for printf %f or Fortran F format with n digits after point)

                I such that

                        | i * 2**e * 10**n - I | <= 0.5

        (for printf %e or Fortran E format with n significant digits)

                I and E such that

                        | i * 2**e - I * 10**E | <= 0.5 * 10**E

                AND
                         10**(n-1) <= I <= (10**n) - 1

        always distinguishing exact conversions
        always rounding inexact halfway cases to nearest even

Of course the integers I may not fit in a conventional C int
or long.

     Since multiple operations occur, so do multiple round-
ing errors if these problems are solved in a straightforward
way using floating-point arithmetic.  These rounding errors
obscure the floating-point value you're trying to display.
Worse, different common algorithms have different rounding
errors so the same program with the same data produces dif-
ferent results on different systems, even for different
operating systems on the same hardware. This is a problem
for network computing:

     Persons doing scientific and engineering calculations
aren't comfortable with network computing that gets dif-
ferent results on different systems, especially if all sys-
tems nominally conform to IEEE 754.  IEEE 754 requires
correct rounding for normal numbers, whose magnitudes range
       -27      +27
from 10    to 10   , but permits slightly incorrect rounding
for larger exponents.  Even so most implementations (like
SunOS 3.5) get it almost right, but not quite, and so get
different results, even for normal exponents. The work
reported below is intended to eliminate this particular
source of gratuitous differences.  Although some customers
complained when upgrading from 3.5 to 4.0 caused "different"
answers to appear for the same computation, I think in the
long term they will be happier knowing that the answers
won't change again because now they are correct.

     The only reasonable uniformity goal is correct round-
ing, which is to say that each conversion operation suffers
at most one rounding error which acts just like rounding
errors for correctly-implemented IEEE multiplication or
division.  Although the basic ideas of correctly-rounded
base conversion and of efficient base conversion are well
documented in Coonen's thesis and Sterbenz' book, fast
correctly-rounded implementations don't seem to be common.
To be fully effective, therefore, the machine-independent
parts of the resulting source code eventually should be
available to other vendors in the same way that NFS was pro-
pagated.  Thus Sun will draw a target around our bullet hole
and invite others to hit it.

     Other kinds of discrepancies that inhibit network com-
puting include extended precision expression evaluation and
elementary transcendental functions.  So there are plenty of
interesting projects here for the future.

Who Cares About Base Conversion Performance?

     Nobody, I thought - surely the work of computing the
answer and printing it greatly exceeds the work of format-
ting it for printing.  But spreadsheet users and students in
elementary Fortran classes often spend more time in conver-
sion from internal binary floating-point form to ASCII
strings for output than anywhere else.  Even some customers
with presumably more sophisticated applications noticed a
performance degradation in SunOS 4.0; the inner loop of one
customer application consists of formatted Fortran write
statements to intermediate files that they process on other
systems.  They couldn't use unformatted files because those
other systems were not IEEE systems.

SunOS Implementations

     Conversion between floating-point binary formats and
ASCII strings has been extensively revised in recent SunOS
releases, first to obtain "correctly rounded" conversions
fully observing IEEE rounding modes and detecting IEEE
exceptions, and then to improve performance.

     Prior to SunOS 3.0, conversion between ASCII strings
and IEEE single or double precision was accomplished by code
written in portable C which used floating-point arithmetic
and very slow, inaccurate algorithms.  Since the available
floating-point hardware was the Sky FFP and the default code
generation was -fswitch, the results of your conversions
depended on whether a Sky board was installed.

     In SunOS 3.x, Richard Tuck vastly improved speed and
accuracy by using extended precision multiplication and
tables of positive and negative powers of ten, all coded
using mc68010 assembler.  The Sun-4 project forced me to
develop a much more portable approach to base conversion.

*    The mc68010 software extended-precision multiplication
     method failed to meet IEEE accuracy requirements
     because tables of negative powers of ten were used
     rather than an extended precision division subroutine.

*    A major rewrite would have been required anyway for
     such an approach to convert to and from 68881 extended
     format or quadruple precision format.

*    Rounding modes and exceptions were ignored.

*    Different input scanners for floating-point numbers
     were used in C and Fortran; neither was able to read
     infinities or NaNs output by printf or write state-
     ments.

*    The native floating-point arithmetic hardware was not
     used in floating-point base conversion because of
     unreliability in the Sky FFP and A79J 68881.

     In sys4-3.2 and SunOS 4.0, I implemented correctly-
rounded conversion - subject to a number of bugs discovered
later and subsequently fixed - by computing powers of ten
exactly, using multiple precision integer arithmetic, as
needed at run time. I added conversions to and from 68881
extended-precision floating-point format.  Rounding modes
and exceptions were handled correctly.  Performance was very
slow on numbers with large exponents, however, due to the
time required to compute large powers of ten.  I developed a
unified scanner for Fortran and C that understands "Infin-
ity" and "NaN", so that all bugs would be shared (and there-
fore discovered sooner).  Some people observed that signifi-
cant stack space was consumed because the largest buffers
that might be needed were always allocated on the stack -
                                     +5000
for extended exponents as large as 10-    .

     In SunOS 4.1 beta, tables of selected small and large
powers of ten, expressed as variable-precision integers,
speed up conversion for large exponents.  This led to
remarkable performance improvements in those cases, and
remarkable increases in size of staticly-linked executables.
In anticipation of an SVR4 ANSI-C compiler with a long dou-
ble type, quadruple-precision base conversion was added.
Large buffers were malloc'd and free'd as needed at run time
rather than allocated on the stack with a fixed size.
Unfortunately 4.1 froze before I had time to add performance
tuning enhancements like internal inline expansion tem-
plates.

     For SPARCompilers 0.0 beta (Fortran 1.3 and C 1.0), I
focussed on performance of base conversion for single- and
double-precision floating-point numbers with normal
exponents.  For instance, inline expansion templates were
used internally to speed up the inner loops in the
multiple-precision integer arithmetic.

     For SPARCompilers 1.0 developmental (Fortran 1.4 and C
1.1), I decided to exploit floating-point hardware when
available to handle the easy cases that didn't require
multiple-precision integer arithmetic to round correctly.
This means that base-conversion speed now is a function of
code generation option.  I reorganized the main tables of
powers of ten to consume about half as much space, at the
cost of about twice the performance on large exponents.
Other new tables also increased executable size and improved
performance by lesser amounts.

Test Programs

     In the following list of test programs, "normal"
floating-point numbers have magnitudes ranging roughly from
  -27      +27
10 -  to 10    , while "large" floating-point numbers have
magnitudes uniformly selected from all those representable
in a floating-point format.

SP.timefio - /valid/timefio -DSP
        100000 Fortran list-directed output normal floating-point single-precision
        100000 Fortran list-directed input normal floating-point single-precision
        100000 Fortran E-formatted output large floating-point single-precision
        100000 Fortran E-formatted input large floating-point single-precision
        100000 Fortran integer I-formatted output
        100000 Fortran integer I-formatted input
DP.timefio - /valid/timefio -DDP
        100000 Fortran list-directed output normal floating-point double-precision
        100000 Fortran list-directed input normal floating-point double-precision
        100000 Fortran E-formatted output large floating-point double-precision
        100000 Fortran E-formatted input large floating-point double-precision
        100000 Fortran integer list-directed output
        100000 Fortran integer list-directed input
timestdio - /valid/timestdio
        100000 stdio sprintf normal floating-point double-precision
        100000 stdio sscanf normal floating-point double-precision
        100000 stdio sprintf large floating-point double-precision
        100000 stdio sscan large floating-point double-precision
        100000 stdio sprintf integer
        100000 stdio sscan integer
timextended - /valid/timextended
        10000 decimal to extended (base.4.0 and later)
                     -  -
        10000 extended to decimal (base.4.0 and later)
                      -  -
timequad - /valid/timequad
        10000 stdio sscanf long double (quadruple-precision) (base.4.1 and later)
        10000 stdio sprintf long double (quadruple-precision) (base.4.1 and later)

In addition, two short programs were linked staticly and the
code+data sizes measured in pages of 8192 bytes.  hello.c
and filter.f:

        main(){printf(" hello \n");}

        read *,x
        print *,x
        end

Test Environments

     Test programs were compiled and executed under the fol-
lowing environments:

base.1.4
     SunOS 1.4 bundled C and Fortran compilers, executables
     run on Sirius with SunOS 4.0 Thanks to Mitch Bradley
     who compiled the test programs on his home machine,
     Sun-2/50 #7.

base.3.5
     SunOS 3.5, bundled Fortran, Sirius

base.4.0
     SunOS 4.0, Fortran 1.1, Sirius or Sunrise

base.4.1
     SunOS 4.0, SunOS 4.1 beta /lib/libc.a, Fortran 1.2,
     Sirius or Sunrise

base.0.0
     SunOS 4.0, SPARCompilers 0.0 beta, Sirius or Sunrise,
     -u   doprnt -u   doscan
        --          --

base.1.0
     SunOS 4.0, SPARCompilers 1.0 developmental, Sirius
     -ffpa, Sirius -f68881, or Sunrise FPU2 -cg89 -u
       doprnt -u   doscan
     --          --

     The following tables of Sun-3/4 results show execu-
tion times in seconds for these test programs.

Sun-3/260 Results

     The following results are from a sirius with an
FPA.  Times in seconds, code+data sizes in pages.

                 -fsoft      -f68881     -ffpa
Test             base.1.0    base.1.0    base.1.0    base.0.0    base.4.1    base.4.0      base.3.5   base.1.4

hello.c
code+data            6+1         7+1         8+1         8+1         4+6           2+1      2+1         2+1
filter.f
code+data           13+2        13+2        15+2        14+2         9+6           8+1      9+1         5+1

SP.timefio
write normal          63          57          49          68          96            66       41         253
read normal           67          50          42          77          96            70       40         255
write big             72          65          55          76         114            74       38         375
read big              73          54          59          85         105            85       30         290
write int             35          35          35          35          37            32       29          61
read int              29          29          29          29          30            26       25          46

DP.timefio
write normal          76          69          59          86         127            91       56         346
read normal           93          70          69         113         148           102       47         377
write big            107         113         110          94         154           451       51        2000
read big             113         119         124         109         141           453       37        1200
write int             31          30          30          30          32            32       29          62
read int              48          39          36          53          58            62       33         136

timestdio
sprintf normal        53          50          38          60          95            69       31         175
sscanf normal         70          36          33          83         112            68       32         108
sprintf big           83          92          86          69         121           388       34        2000
sscanf big            96          97          94          86         118           346       33         112
sprintf int           11.6        12.2        12.1        12.6        13.3          13.4     12.4        20.3
sscanf int            13.0        13.2        13.0        12.7        19.7          13.7     12.3        13.7

timextended
dec to bin       11.1-23.0   11.4-23.3   11.3-23.2    5.5-10.6    7.8-19.7   1700-3400        X          X
bin to dec       13.7-25.6   13.7-25.6   13.6-25.6    9.8-12.1   20.7-28.5   2400-3200        X          X

timequad
sscanf           11.9-25.2   12.1-25.5   12.3-25.4    6.3-12.6    9.9-23.9        X           X          X
sprintf          16.2-30.9   16.4-31.0   16.2-31.0   10.6-16.6   23.1-41.1        X           X          X

Sun-4/260 Results

     The following results are from a sunrise with
FPU2.

                 -cg89
Test             base.1.0    base.0.0    base.4.1    base.4.0

hello.c
code+data            9+1        10+1         6+6            3+2
filter.f
code+data           17+2        18+2        13+6           10+1

SP.timefio
write normal          23          44          59             43
read normal           17          40          46             50
write big             25          51          67             59
read big              25          45          49             81
write int             25          24          27             25
read int              14          14          15             13

DP.timefio
write normal          30          60          83             61
read normal           35          61          71             88
write big             72          71          96           1077
read big              74          65          67           1720
write int             23          25          23             23
read int              15          27          27             26

timestdio
sprintf normal        19          39          52             38
sscanf normal         14          44          53             39
sprintf big           56          51          73           1070
sscanf big            58          47          57           1191
sprintf int            7.4         9.3         9.5            9.3
sscanf int             6.1         5.9         9.2            5.9

timextended
dec to bin        8.4-27.3     3.7-8.4    4.0-11.3   2800-21000
bin to dec       10.6-26.0    9.9-12.3   11.4-17.5   3300-20000

timequad
sscanf            8.2-23.1     4.3-9.9    4.8-13.4        X
sprintf          12.6-31.8   12.6-27.3   10.6-16.6        X