experiments with caching in pow(x,y)

Sat Apr 2 15:28:25 PST 1994

Anybody interested in tbl | troff -ms source for the following should send
me email after April 11.

It appears to be worthwhile to cache a few values of log2(x) in pow(x,y), and
to precompute log2{1,2,10}.

Introduction

     In order to determine if there were machine-independent ways to improve
execution of x**y, I decided to determine whether caching values of log2(x)
would be worthwhile within pow(x,y).   I expect that aggressively optimizing
compilers will consider simple functions like exp(x) or sin(x) to be constants
when x is invariant in a loop, and I would expect that programmers attempting
to optimize by hand would remove such computations from loops in the source. I
would not expect a compiler to recognize optimization possibilities in
289.0d0**y, though a programmer might if he knew y were constant.   A program-
mer should be cautious about optimizing 289.0d0**y as exp(y*log(289.0d0))
since this will entail accuracy loss compared to a carefully-implemented pow
function such as that in fdlibm.

     x**y and pow(x,y) are typically implemented, in principle, as
pow(2.0d0,y*log2(x)), although the devil is in the details if error bounds
close to one ulp are intended in extreme cases - extra precision must be used
explicitly or implicitly.    The computation of log2(x) in higher precision is
at least half of the cost of pow, so caching values could be worth investigat-
ing.

     I experimented with a number of programs from SPEC, PERFECT, and else-
where, all originally at least realistic applications.   I compiled Fortran
with f2c so that all references to x**y in single or double precision were
translated to calls to pow(x,y).    Using SunOS 5.3 /usr/lib/libm.a, based on
fdlibm, I was able to substitute a slightly rewritten pow(x,y) wrapper, as
listed at the end of this report, to track calls referring to log(1), log(2),
or log(10), as well as to implement a 7-deep cache of most recently used x
values. The compiler was SunPro's current "SC3.0" release.

     Unfortunately this approach is not applicable to SunPro's current multi-
thread-safe libraries.    The cost of locking or unlocking a shared cache, or
of using per-thread storage for caching, would probably eliminate any savings.
More generally, even in single threads, testing for n cached or precomputed
values might encompass log2(n) compares in the main path, which would slow
down all pow computations.  The net results are thus in doubt and warrant
further investigation.

Results

     spice3e2: Most spice3e2 input decks exercise pow but little; but those
that do tend to compute pow(10.0,y) exclusively.   The test decks pz2, pzt,
and simplepz (poles and zeros) exercise pow(2.0,y) primarily.  Thus it appears
worthwhile to include 2 and 10 among the special pre-computed values of log2.

     plume:  Plume is a Sandia benchmark program which is unfortunately not
distributable, which achieves a 68% hit ratio with a 1-deep cache and a 75%
hit ratio with a 2-deep cache.  The seven-deep cache hit 600 thousand out of
700 thousand calls, within a total program execution time of 31 seconds.  The
popular values of x-1 that hit the cache appear to be:

                               hits      x-1

                             120130      999
                               9730      269.881
                               7130      269.867
                               7110      269.88
                               6870      269.887
                               6420      269.891
                               6330      269.875
                               5900      269.886
                               5870      269.869
                               5360      269.882

x-1 was printed rather than x in order to finely differentiate x values close
to 1.  The physical significance of the function 1000.0**y would have to be
derived from the source code, I imagine.

     herwig: herwig,"Hadron Emission Reactions With Interfering Gluons" is a
Monte Carlo code by Marchesini, Knowles, and Webber, whose distributability
status is unknown, though it may be available from CERN.  The seven-deep cache
hit 12 million out of 16 million calls, over 720 seconds.  A one-deep cache
achieves a 68% hit ratio, and a two-deep 73%.  The popular values of x-1 that
hit the cache appear to be:

                             hits     x-1

                           3880         1.52573
                           1770        -0.999277
                           1320        -0.999265
                           1050        -0.999278
                            860        -0.000722229
                            710        -0.999072
                            610        -0.999275
                            470        -0.999276
                            450        -0.999273
                            410        -0.999251

     perfect.adm: adm achieved 59% with a one-deep and 64% with a two-deep
cache.  The seven-deep cache hit 400 thousand out of 600 thousand calls, over
100 seconds.  The popular values of x-1 that hit the cache appear to be:

                              hits      x-1

                              75        173.218
                              42        173.918
                              40        174.373
                              33        175.288
                              31        176.629
                              28        175.49
                              19        189.99
                              19        188.464
                              19        177.614
                              18        189.456

     perfect.arc2d: arc2d was called 43% of the time with x==1, and the rest
of the time with x within a few ulps of 1.  Detecting 1 as a special case is
already done in fdlibm pow; beyond that, a one-deep cache hits 7% and a two-
deep 10%.  The seven-deep cache hit 30 thousand out of 60 thousand calls, over
590 seconds.  None of the other PERFECT Club programs had significant hits.
The popular values of x-1 that hit the cache appear to be:

                           hits        x-1

                          23780          0
                           3070          2.22045e-16
                           1300         -1.11022e-16
                           1160         -4.44089e-16
                            120          4.44089e-16
                             88         -3.33067e-16
                             86         -6.66134e-16
                             25         -1.22125e-15
                             19         -2.22045e-15
                             15         -2.88658e-15

This program deserves further investigation; why do people compute 1**y, or
(1+eps)**y, repeatedly ?

     jetset74: jetset, "The Lund Monte Carlo for Jet Fragmentation and e+e-
Physics" by Sjostrand and Bengtsson, is available from Sjostrand and CERN.  A
one-deep cache hits 5%, a two-deep 43%, and a four-deep 80%.  The seven-deep
cache hit 2.9 million out of 3.5 million calls, over 210 seconds.  The popular
values of x-1 that hit the cache appear to be:

                            hits       x-1

                            20170       -0.998556
                             6680       -0.999999
                             6030        9
                             5840       -0.778297
                             5520       -0.999996
                             4840       -1
                             4810       -0.999961
                             4660       -0.88121
                             4590       -0.586227
                             4530       -0.727867

     isajet: isajet, "A Monte Carlo Event Generator" by Paige et al, is avail-
able for academic purposes from Paige at Brookhaven.  A one-deep cache hits
27%, a two-deep 45%, four-deep 71%.  The seven-deep cache hit 1.0 million out
of 1.3 million calls, over 100 seconds.  The popular values of x-1 that hit
the cache appear to be:

                            hits      x-1

                            46         -0.0501198
                            42         -0.050005
                            40         -0.050205
                            34         -0.0383852
                            26         -0.0541548
                            23         -0.0503193
                            22         -0.0582043
                            21         -0.186438
                            21         -0.0583095
                            21         -0.0502395

     rayshade406: rayshade, a computer graphics program in C, is available
from Yale.   Results vary according to input file, but one test drawing,
"pool", hit 19% with a one-deep cache, 330 thousand times out of 1.8 million
calls, over 350 seconds.   A different drawing, "boxball", hit 14%.  Typical
hits:

                              hits     x-1

                            11          -0.0792645
                             7          -0.0792648
                             6          -0.0792658
                             6          -0.0792657
                             6          -0.0792651
                             6          -0.0792646
                             5          -0.0792676
                             5          -0.0792653
                             5          -0.079265
                             5          -0.0792647

     015.doduc: doduc, the nuclear reactor simulator, is the only SPEC program
with significant hits: A one-deep cache hits 18%, a two-deep 34%.  The seven-
deep cache hit 150 thousand out of 460 thousand calls in 54 seconds total pro-
gram execution time.  The popular values of x-1 that hit the cache appear to
be:

                             hits        x-1

                           10100          -1
                               5           5.22095
                               5           5.22093
                               5           5.21187
                               5           5.12077
                               5           5.01914
                               5           4.77667
                               5           4.39474
                               4           5.36838
                               4           5.21252

One would certainly wonder why 0**y is so popular to compute.

Modified w pow.c from fdlibm

     This is the modified w pow.c used to cache x values.   It prints a line
of output for every cache hit up to 100, then every tenth cache hit after that
- some test programs (isajet and jetset) were generating many megabytes of
cache hit data.

#include "fdlibm.h"
#include <stdio.h>

#define CACHESIZE 10
#define FIXEDCACHESIZE 3

double cache[CACHESIZE] = { 1.0, 2.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 } ;
long    hits[CACHESIZE] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } ;
long    misses          = 0 ;

        double pow(double x, double y)  /* wrapper pow */
{
int icache, j;

for (icache=0 ; icache<CACHESIZE ; icache++)
        if (x == cache[icache]) break ;
if (x == cache[icache])
        { /* hit */
        if (icache >= FIXEDCACHESIZE) {
                for (j=icache; j>FIXEDCACHESIZE; j--) cache[j]=cache[j-1];
                cache[FIXEDCACHESIZE] = x ;
        }
        hits[icache] ++ ;
        fflush(stderr);
        if ((hits[icache] < 100) || ((hits[icache] % 10) == 0))
                printf("0ow %d %g ",10*misses+icache,x-1);
        fflush(stdout);
}
else
        { /* miss */
        for (j=CACHESIZE-1; j>FIXEDCACHESIZE; j--) cache[j]=cache[j-1];
        cache[FIXEDCACHESIZE]=x;
        misses++;
}

        return    ieee754 pow(x,y);
}