experiments with caching in pow(x,y)
David G. Hough at validgh
dgh
Sat Apr 2 15:28:25 PST 1994
Anybody interested in tbl | troff -ms source for the following should send
me email after April 11.
It appears to be worthwhile to cache a few values of log2(x) in pow(x,y), and
to precompute log2{1,2,10}.
Introduction
In order to determine if there were machine-independent ways to improve
execution of x**y, I decided to determine whether caching values of log2(x)
would be worthwhile within pow(x,y). I expect that aggressively optimizing
compilers will consider simple functions like exp(x) or sin(x) to be constants
when x is invariant in a loop, and I would expect that programmers attempting
to optimize by hand would remove such computations from loops in the source. I
would not expect a compiler to recognize optimization possibilities in
289.0d0**y, though a programmer might if he knew y were constant. A program-
mer should be cautious about optimizing 289.0d0**y as exp(y*log(289.0d0))
since this will entail accuracy loss compared to a carefully-implemented pow
function such as that in fdlibm.
x**y and pow(x,y) are typically implemented, in principle, as
pow(2.0d0,y*log2(x)), although the devil is in the details if error bounds
close to one ulp are intended in extreme cases - extra precision must be used
explicitly or implicitly. The computation of log2(x) in higher precision is
at least half of the cost of pow, so caching values could be worth investigat-
ing.
I experimented with a number of programs from SPEC, PERFECT, and else-
where, all originally at least realistic applications. I compiled Fortran
with f2c so that all references to x**y in single or double precision were
translated to calls to pow(x,y). Using SunOS 5.3 /usr/lib/libm.a, based on
fdlibm, I was able to substitute a slightly rewritten pow(x,y) wrapper, as
listed at the end of this report, to track calls referring to log(1), log(2),
or log(10), as well as to implement a 7-deep cache of most recently used x
values. The compiler was SunPro's current "SC3.0" release.
Unfortunately this approach is not applicable to SunPro's current multi-
thread-safe libraries. The cost of locking or unlocking a shared cache, or
of using per-thread storage for caching, would probably eliminate any savings.
More generally, even in single threads, testing for n cached or precomputed
values might encompass log2(n) compares in the main path, which would slow
down all pow computations. The net results are thus in doubt and warrant
further investigation.
Results
spice3e2: Most spice3e2 input decks exercise pow but little; but those
that do tend to compute pow(10.0,y) exclusively. The test decks pz2, pzt,
and simplepz (poles and zeros) exercise pow(2.0,y) primarily. Thus it appears
worthwhile to include 2 and 10 among the special pre-computed values of log2.
plume: Plume is a Sandia benchmark program which is unfortunately not
distributable, which achieves a 68% hit ratio with a 1-deep cache and a 75%
hit ratio with a 2-deep cache. The seven-deep cache hit 600 thousand out of
700 thousand calls, within a total program execution time of 31 seconds. The
popular values of x-1 that hit the cache appear to be:
hits x-1
120130 999
9730 269.881
7130 269.867
7110 269.88
6870 269.887
6420 269.891
6330 269.875
5900 269.886
5870 269.869
5360 269.882
x-1 was printed rather than x in order to finely differentiate x values close
to 1. The physical significance of the function 1000.0**y would have to be
derived from the source code, I imagine.
herwig: herwig,"Hadron Emission Reactions With Interfering Gluons" is a
Monte Carlo code by Marchesini, Knowles, and Webber, whose distributability
status is unknown, though it may be available from CERN. The seven-deep cache
hit 12 million out of 16 million calls, over 720 seconds. A one-deep cache
achieves a 68% hit ratio, and a two-deep 73%. The popular values of x-1 that
hit the cache appear to be:
hits x-1
3880 1.52573
1770 -0.999277
1320 -0.999265
1050 -0.999278
860 -0.000722229
710 -0.999072
610 -0.999275
470 -0.999276
450 -0.999273
410 -0.999251
perfect.adm: adm achieved 59% with a one-deep and 64% with a two-deep
cache. The seven-deep cache hit 400 thousand out of 600 thousand calls, over
100 seconds. The popular values of x-1 that hit the cache appear to be:
hits x-1
75 173.218
42 173.918
40 174.373
33 175.288
31 176.629
28 175.49
19 189.99
19 188.464
19 177.614
18 189.456
perfect.arc2d: arc2d was called 43% of the time with x==1, and the rest
of the time with x within a few ulps of 1. Detecting 1 as a special case is
already done in fdlibm pow; beyond that, a one-deep cache hits 7% and a two-
deep 10%. The seven-deep cache hit 30 thousand out of 60 thousand calls, over
590 seconds. None of the other PERFECT Club programs had significant hits.
The popular values of x-1 that hit the cache appear to be:
hits x-1
23780 0
3070 2.22045e-16
1300 -1.11022e-16
1160 -4.44089e-16
120 4.44089e-16
88 -3.33067e-16
86 -6.66134e-16
25 -1.22125e-15
19 -2.22045e-15
15 -2.88658e-15
This program deserves further investigation; why do people compute 1**y, or
(1+eps)**y, repeatedly ?
jetset74: jetset, "The Lund Monte Carlo for Jet Fragmentation and e+e-
Physics" by Sjostrand and Bengtsson, is available from Sjostrand and CERN. A
one-deep cache hits 5%, a two-deep 43%, and a four-deep 80%. The seven-deep
cache hit 2.9 million out of 3.5 million calls, over 210 seconds. The popular
values of x-1 that hit the cache appear to be:
hits x-1
20170 -0.998556
6680 -0.999999
6030 9
5840 -0.778297
5520 -0.999996
4840 -1
4810 -0.999961
4660 -0.88121
4590 -0.586227
4530 -0.727867
isajet: isajet, "A Monte Carlo Event Generator" by Paige et al, is avail-
able for academic purposes from Paige at Brookhaven. A one-deep cache hits
27%, a two-deep 45%, four-deep 71%. The seven-deep cache hit 1.0 million out
of 1.3 million calls, over 100 seconds. The popular values of x-1 that hit
the cache appear to be:
hits x-1
46 -0.0501198
42 -0.050005
40 -0.050205
34 -0.0383852
26 -0.0541548
23 -0.0503193
22 -0.0582043
21 -0.186438
21 -0.0583095
21 -0.0502395
rayshade406: rayshade, a computer graphics program in C, is available
from Yale. Results vary according to input file, but one test drawing,
"pool", hit 19% with a one-deep cache, 330 thousand times out of 1.8 million
calls, over 350 seconds. A different drawing, "boxball", hit 14%. Typical
hits:
hits x-1
11 -0.0792645
7 -0.0792648
6 -0.0792658
6 -0.0792657
6 -0.0792651
6 -0.0792646
5 -0.0792676
5 -0.0792653
5 -0.079265
5 -0.0792647
015.doduc: doduc, the nuclear reactor simulator, is the only SPEC program
with significant hits: A one-deep cache hits 18%, a two-deep 34%. The seven-
deep cache hit 150 thousand out of 460 thousand calls in 54 seconds total pro-
gram execution time. The popular values of x-1 that hit the cache appear to
be:
hits x-1
10100 -1
5 5.22095
5 5.22093
5 5.21187
5 5.12077
5 5.01914
5 4.77667
5 4.39474
4 5.36838
4 5.21252
One would certainly wonder why 0**y is so popular to compute.
Modified w pow.c from fdlibm
This is the modified w pow.c used to cache x values. It prints a line
of output for every cache hit up to 100, then every tenth cache hit after that
- some test programs (isajet and jetset) were generating many megabytes of
cache hit data.
#include "fdlibm.h"
#include <stdio.h>
#define CACHESIZE 10
#define FIXEDCACHESIZE 3
double cache[CACHESIZE] = { 1.0, 2.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 } ;
long hits[CACHESIZE] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } ;
long misses = 0 ;
double pow(double x, double y) /* wrapper pow */
{
int icache, j;
for (icache=0 ; icache<CACHESIZE ; icache++)
if (x == cache[icache]) break ;
if (x == cache[icache])
{ /* hit */
if (icache >= FIXEDCACHESIZE) {
for (j=icache; j>FIXEDCACHESIZE; j--) cache[j]=cache[j-1];
cache[FIXEDCACHESIZE] = x ;
}
hits[icache] ++ ;
fflush(stderr);
if ((hits[icache] < 100) || ((hits[icache] % 10) == 0))
printf("0ow %d %g ",10*misses+icache,x-1);
fflush(stdout);
}
else
{ /* miss */
for (j=CACHESIZE-1; j>FIXEDCACHESIZE; j--) cache[j]=cache[j-1];
cache[FIXEDCACHESIZE]=x;
misses++;
}
return ieee754 pow(x,y);
}
More information about the Numeric-interest
mailing list