Effects of Pentium Division Flaw and its Software Workaround
David G. Hough at validgh
dgh
Wed Feb 22 09:24:47 PST 1995
tbl | troff -ms source for the following, which permits prettier output
and includes the diff listings
at the end that I omitted here to save space, is available from
dghavalidgh.com.
The infamous Intel Pentium floating-point division flaw is seldom visible in
the results of realistic technical applications, nor does it perceptibly
affect performance. But some programs skillfully designed to look for
arithmetic flaws can find it. In contrast, the expected harmless differences
between 486 and Pentium elementary transcendental functions, due to the
improved approximations in the latter, are often evident in ordinary
applications.
Intel and Cygnus published a recommended compiler workaround that reduces
the effect of the Pentium division flaw to at most one unit per division, in
the least significant bit of extended precision. Intel has also modified its
math library product libm.a to avoid the division flaw. The compiler and libm
workarounds do not affect results of CPU chips other than flawed Pentium
chips.
The two modifications to the compiler and libm avoid any severe effects
of the Pentium flaw, but sometimes cause harmlessly different results in real-
istic technical applications. The modifications degrade performance of flawed
CPU's by a median of 1%, and SPECfp92 ratios by about 9%.
Scope of Report
This report describes the results of an investigation into the Intel
Pentium CPU division flaw and Intel's compiler and library workarounds for it.
The Intel Pentium CPU division flaw is caused by an incomplete table in the
division hardware of all Pentium CPU's shipped prior to late 1994. On rare
occasions it causes incorrect results on floating-point divisions, and even
less frequently on remainder, tangent, and arctangent operations. The
relative error can very rarely be as large as several parts in 10**5.
In the context of technical computing applications, this report addresses
the questions:
Correctness:
How does the Pentium flaw and its workarounds affect program correctness
compared to an unflawed Pentium with no workarounds?
Performance:
How does the Pentium flaw and its workarounds affect program performance
compared to an unflawed Pentium with no workarounds?
Summary
effect of Pentium flaw with unmodified software
By using the same executables, produced by unmodified compilers and li-
braries, on flawed and unflawed Pentiums, no performance difference was
observed, and no difference in output was observed for any application
program tested. The flaw was only manifest in programs specifically
devised to carefully test floating-point division and elementary tran-
scendental functions, especially the UCBTEST programs devised by Prof. W.
Kahan and his students. Conclusion: The Pentium flaw will very rarely
affect the results of typical scientific applications. These programs
perform many floating-point operations but perhaps relatively few of
those operations are likely to be vulnerable to producing incorrect deci-
sions as a result of the Pentium flaw. The flaw may arise much more fre-
quently than the frequency of misleading output, to the extent that sub-
sequent calculations obscure flaw effects. In contrast, typical commer-
cial spreadsheet users perform fewer floating- point operations, but an
incidence of the flaw may have a higher probability of affecting the
visible output, particularly if the data contains many numbers slightly
different from small integers. Fortunately almost all these spreadsheet
users may employ modified software to work around the flaw or even dis-
able floating-point hardware entirely with little perceptible performance
loss.
effect of Pentium transcendental functions improvements
In contrast to the foregoing, the same executables, produced by unmodi-
fied compilers and libraries, on 486 and unflawed Pentiums, revealed
differences in five realistic applications and five transcendental func-
tion test programs. Conclusion: Persons attempting to verify flawed Pen-
tium scientific calculations on 486 systems are far more likely to dis-
cover the Pentium transcendental improvements than the division flaw.
effect of software modifications on unflawed Pentium
Unflawed Pentium systems produce identical results when compiled either
with unmodified compilers and math library, or with compilers and math
library modified to avoid the division flaw. There is no average perfor-
mance difference. Conclusion: On unflawed Pentiums, the software wor-
karounds do not affect the correctness or performance of a variety of
realistic technical applications.
effect of software modifications on flawed Pentium
Comparing results of a flawed Pentium with software modifications to to
an unflawed Pentium without software modifications revealed some signifi-
cant differences. Five realistic applications printed slightly different
numerical results. Whereas a flawed Pentium with no software modifica-
tions failed ucbdivtest by six units in single precision after 750000
test cases, the flawed Pentium with software modifications failed
ucbdivtest by one unit in extended precision after 61 test cases. As
advertised, the software workarounds reduce the worst-case relative error
from a few parts in 10**5 to to one part in 10**19. The median perfor-
mance degradation was 1%, and the first and third quartile points were at
86% and 100% compared to 98% and 100% for the previous case. This per-
formance degradation was due to the extra overhead invoked by the
software modifications to check division arguments and work around poten-
tially hazardous situations. Conclusion: On flawed Pentiums, the
software workarounds do not affect the correctness of a variety of real-
istic technical applications, and avoid any severe correctness problems
caused by the division flaw. The performance penalty due to the software
modifications on flawed Pentiums is occasionally noticeable but usually
tolerable.
Test Configurations
Test configurations consisted of a host PC EISA system with an Intel CPU,
a compiler and libraries, and specific compilation options. All test confi-
gurations used Solaris 2.4 for x86 with Driver Update 3 as the operating sys-
tem, with 200MB of tmpfs /tmp+swap on Seagate ST3600N, ST3620N, or ST11200N
FAST SCSI disks. Host details are as follows:
HOSTS
Name Vendor CPU RAM
fix1 Intel 90 MHz Pentium unflawed 64MB
fix2 Intel 90 MHz Pentium unflawed 64MB
flaw1 Intel 90 MHz Pentium flawed 64MB
flaw2 Intel 90 MHz Pentium flawed 64MB
gateway Gateway 66 MHz 486DX2 40MB
All were NFS clients of a SPARCstation 10/41 file server containing the test
programs, input data, executables, and results. Execution took place locally
in /tmp.
All the Pentium systems were identical except for the CPU. They each had
writeback cache enabled, an Adaptec 274x SCSI controller, and an SMC 8016T
ethernet controller installed. Enabling writeback cache is critical to Penti-
um performance - writethrough cache produced 2X slower SPEC ratios.
COMPILERS
Name Compiler Options Math libraries
flaw8 GCC 2.6.3 with Cygnus patch aligned -mfpflaw workaround,fdlibm,sun
flaw8m GCC 2.6.3 with Cygnus patch aligned -mfpflaw noworkaround,fdlibm,sun
noflaw8p GCC 2.6.3 with Cygnus patch aligned -mno-fpflaw workaround,fdlibm,sun
noflaw8 GCC 2.6.3 with Cygnus patch aligned -mno-fpflaw noworkaround,fdlibm,sun
noflaw GCC 2.6.3 with Cygnus patch -mno-fpflaw noworkaround,fdlibm,sun
263 GCC 2.6.3 unpatched fdlibm,sun
The Cygnus patch for GCC 2.6.3 converts all floating-point division
instructions to register-register form in an early stage of the compiler
and later converts those instructions into calls to a subroutine that
works around the division flaw. In some cases the extra instructions may
cause a slight performance degradation. (Intel's own compiler, not test-
ed in this study, avoids the first conversion and calls separate wor-
karound subroutines for each form of division instruction; higher levels
of optimization can inline those functions as well.) In addition, conver-
sion of an instruction to an external subroutine call usually inhibits a
number of optimization techniques.
"aligned" in the table refers to the additional modifications
described below for more predictable performance.
Math libraries were linked in the order indicated. In all cases,
Fortran programs were compiled after translation with F2C from AT&T Bell
Labs.
LIBM.A'S
workaround Intel library with workarounds
noworkaround Intel library with no workarounds
fdlibm Freely-distributable library
sun Sun ProCompiler C 2.0.1 library
When released, this version of the Intel library will be a commercial
product optimized for Pentium, available with or without workarounds for
the Pentium division flaw; it supports single, double, and extended pre-
cision. (Intel's own compiler, not tested in this study, links with the
workaround library by default, since Pentium compilation is the default;
but if compilation is specified for a 486 or earlier, then the library
without the workarounds is used.)
fdlibm is a freely-distributable library, supporting double preci-
sion only, made available at netlib by SunSoft Developer Products. The
Sun ProCompiler library is a commercial product, and was included at the
end of the link sequence so that certain IEEE 754 functions needed for
some of the test programs would be available in all the configurations;
it supports single, double, and extended precision, although the double
precision support was never used because the intel library sometimes and
fdlibm always preceded it in the linking sequence.
Two levels of optimization were tested: "g" (-g) was used to deter-
mine correct output initially, but most of the tests were run with "max"
GCC optimizations:
-O3
-m486
-finline-functions
-funroll-loops
-fomit-frame-pointer
-fwritable-strings
-static
-ffloat-store
Some of the foregoing may have no effect on Intel systems. After
considerable experimentation I determined that -ffloat-store was required
to execute many of the more sensitive test programs correctly, although
it inhibits some optimizations, and that -ffast-math caused too many
problems to be used routinely either with sensitive test programs or
realistic applications, so I discarded all results compiled with -ffast-
math or without -ffloat-store and started over. No doubt by compiling
with -ffast-math and without -ffloat-store, better performance could be
obtained on many programs.
The Reference Configuration used in correctness and performance
comparisons was the 90MHz unflawed Pentium system fix2, with the noflaw8
compiler and libraries, using maximum optimizations.
Software modifications
It was necessary to modify the Cygnus and Intel software slightly
for the purposes of this study.
Alignment: The Pentium processor requires that double-precision data
be aligned on 8-byte boundaries for optimum performance; the penalty for
misalignment can be almost 2X. However the Intel ABI does not specify
8-byte alignment for double-precision data, and GCC does not so align it.
I modified GCC to do such alignment everywhere but on the stack, where
8-byte alignment probably would require much more extensive compiler
modifications. Instead I stabilized (rather than optimized) stack
alignment by insuring that it was consistently aligned prior to the invo-
cation of main(), so that minor changes in storage layout such as those
induced by different libraries would have less of an effect on perfor-
mance measurements - important since the performance changes due to the
Pentium flaw workaround software are mostly very small.
486 is also susceptible to performance variations due to misalign-
ment, but the worst case is more like 1.2X.
Initial inexact exception: IEEE arithmetic requires that all excep-
tion flags including inexact be cleared at the start of a program. The
tests for the presence of the Pentium division flaw in GCC's crt1.s and
Intel's libm.a both executed an inexact division causing the flag to be
set, which was noticed and reported as an error by a few of the sensitive
test programs. So I modified the division flaw test functions to re-
store the exception flags after performing the test.
Test Programs
For each test configuration, tests were conducted by compiling 126
separate source programs, some of which were compiled separately for
single, double, and extended precision, making a total of 168 executable
programs. Some of these had many separate input files, making 615
separate executions and output files to be compared for evaluating
correctness. Some of these outputs produced multiple timing data, so
there were 851 timing data in all. Of these, most were not included in
the performance analysis because they ran less than ten seconds and so
were not timed accurately enough under Unix. About 310 timings were
usable in most of the performance comparisons. Some of the most extreme
timing outliers were repeated to eliminate irreproducible glitches.
Although each individual timing datum is subject to considerable
variation from run to run, the overall conclusions are not likely to
change much.
Results from several kinds of test programs are reported below:
sensitive
Sensitive test programs are intended to investigate correctness of
difficult cases and boundary conditions. They depend intimately on
correct rounding and exception handling per IEEE 754, or on close-
to-correct rounding of elementary transcendental functions in libm.
They mostly execute quickly and so had little effect on performance
comparisons, but contribute many of the differences in the correct-
ness comparisons. Examples:
elefunt Cody's elementary function tests
cvector Coonen's IEEE test vectors
ucbmul Kahan's multiplication test
ucbdiv Kahan's division test
ucbsqr Kahan's sqrt test
paranoia Kahan's general arithmetic tests
liu Liu's elementary function tests (older version)
ucbeef Liu's elementary function tests (current version)
kcvector Ng's elementary function tests
fkcvector Ng's elementary function tests
ucblibtest Ng's elementary function tests
ucbflibtest Ng's elementary function tests
<any> SP single-precision version
<any> DP double-precision version
<any> QP extended-precision version
kernels
Performance measurement programs based on collections of short loops are
useful for detecting small-scale phenomena like cache variations, due to
slight differences in compiled code. They often exaggerate the impact
on realistic applications. Being artificial, they often lack measures
of correctness other than ad-hoc checksums applied to their computed out-
puts. Examples:
sl#N Livermore loop N, single precision
dl#N Livermore loop N, double precision
sd#N Digital Review loop N, single precision
dd#N Digital Review loop N, double precision
sn#lll NAS Kernels loop lll, single precision
dn#lll NAS Kernels loop lll, double precision
benchmark suites
SPECfp92, SPECint92, PERFECT, and the Los Alamos benchmark suites are
collections of performance test programs intended to represent realistic
technical applications. They usually include some kind of internal test
of correctness of output to guard against flaws in hardware, optimizing
compilers, or libraries. Often the benchmark versions of source codes
and input data have been sanitized somewhat, compared to the real thing,
in order to achieve better portability among test platforms.
SPEC 0nn.*
PERFECT adm,arc2d,bdna,dyfesm,flo52,mdg,mg3b,ocean,qcd2,spec77,track,trfd
Los Alamos gamteb,hydro,intmc,photon,vgam
slalom scalable benchmark code
realistic applications
Over the years SunSoft Developer Products has received a number of large
realistic applications that present interesting correctness or perfor-
mance problems, from Sun's field engineers, technical support, customers,
and even USENET postings. However most proved unexceptional in this
study. Examples:
3e2 SPICE 3E2 semiconductor device simulation
dnacompare dna sequence comparison
geodetic8 geodetic distance in spacetime
g4 exact rational system analyzer
herwig57 hadron emission reactions
jetset74 jet fragmentation physics
launch junction calculations by modal matching
ray rayshade 4.06 graphics rendering program
reweight particles in detector
Performance ratios
The following table reports ratios related to SPECfp92 and SPECint92.
These ratios are comparable within this report, but not with those measured or
reported elsewhere, because the SPEC source programs and test infrastructure
were modified for this investigation. In particular, the best SPEC ratios
reported elsewhere for Pentium systems have been obtained with compilers that
optimize more aggressively and specifically for Pentium than GCC 2.6.3 and
F2C. As with other SPEC ratios, the reference performance times come from a
VAX 11/780 rather than the noflaw8.max.fix2 reference times used elsewhere in
this report.
~SPEC~92~ratios
Comp Opt Host fp int
flaw8 max fix1 48 62
noflaw8 max fix2 48 62
noflaw8p max flaw2 48 62
noflaw max fix2 47 62
noflaw8 max flaw2 47 62
flaw8 max flaw1 44 62
flaw8m max flaw1 44 62
263 max fix1 43 62
noflaw8 max gateway 15 27
Relative Performance Graphs
Relative performance is defined as the ratio of execution times, ex-
pressed as a percentage. Thus if a particular test required 120 seconds in
the flaw8.max.flaw1 configuration and 80 seconds in the reference configura-
tion, the relative performance percentage would be 80/120 = 67% because only
2/3 of the reference performance was obtained.
After all the relative performance percentages were computed for a par-
ticular configuration, they were sorted and the 0%, 25%, 50%, 75%, and 100%
quartile levels were determined and plotted in the graphs below. A typical
line
% % % % % % 1 1 1 1 2 3 4
2 3 4 5 7 9 0 1 3 7 3 0 0
5 3 4 8 6 0 0 2 2 5 0 0 0 # comp opt host
6==31===-----------74 313 nofl8 max gate
indicates the following information for the noflaw8.max.gateway configuration,
based upon 313 timing data: the worst relative performance percentage was 6%,
the median was 31%, and the best was 57%. The median performance reflects the
relative difference between the 486 66DX2 Gateway system and the reference
system. The ==31=== double bars indicate the region between the 25% and 75%
quartiles, which contains half the data; those relative performance points may
be read from the scale as approximately 27% and 37%. The scale is logarithmic
so that relative performances of 400% and 25% are equally distant from 100%.
The performance spread is so slight in most of the comparison configurations
that the 25%, 50%, and 75% quartiles are superimposed.
Relative run performance
% % % % % % 1 1 1 1 2 3 4
2 3 4 5 7 9 0 1 3 7 3 0 0
5 3 4 8 6 0 0 2 2 5 0 0 0 # comp opt host
79--100108 312 nofl8 max fl2
77---100116 312 nofl8p max fl2
68-----100112 314 fl8 max fx1
59---===97=-126 313 263 max fx1
59------=99-122 313 nofl max fx2
58-----==99108 312 fl8 max fl1
58------=99110 313 fl8m max fl1
6==31===-----------74 313 nofl8 max gate
The following provides the same information in numerical form:
0% 25% 50% 75% 100% tests comp.opt.host
100 100 100 100 100 314 noflaw8.max.fix2
79 100 100 100 108 312 noflaw8.max.flaw2
77 99 100 100 116 312 noflaw8p.max.flaw2
68 98 100 100 112 314 flaw8.max.fix1
59 78 97 100 126 313 263.max.fix1
59 91 99 100 122 313 noflaw.max.fix2
58 86 99 100 108 312 flaw8.max.flaw1
58 89 99 100 110 313 flaw8m.max.flaw1
6 27 31 36 74 313 noflaw8.max.gateway
Relative Performance Extremes
The following tables list the tests displaying the eight most extreme
relative performance percentages for each comparison configuration. The 100%
reference performance configuration is noflaw8.max.fix2.
Relative performance extremes
% test % test % test % test comp opt host
59 dl#1 59 dl#1x1001x8 60 dl#19 60 dl#12 263 max fx1
106 ducbmul 117 liu SP 118 094.fpppp 126 ducbdiv 263 max fx1
68 qucbsqr 70 fpppp 70 ducbsqr 86 ducbeef fl8 max fx1
105 sl#11 107 g4.m35 108 g4.m33 112 015.doduc fl8 max fx1
59 dl#1 60 fpppp 60 dl#19 60 dl#1x1001x8 nofl max fx2
109 015.doduc 111 liu SP 117 dn#emit 122 094.fpppp nofl max fx2
58 dd#AIRREL 58 sd#AIRREL 63 dd#EGYPT 63 sd#EGYPT fl8 max fl1
107 g4.m34 107 g4.m36 108 g4.m22 108 g4.m35 fl8 max fl1
58 sd#AIRREL 58 dd#AIRREL 63 sd#EGYPT 63 dd#EGYPT fl8m max fl1
104 sl#11 104 dl#16 106 dn#cfft2d 110 g4.m33 fl8m max fl1
79 dd#ALAM18 92 052.alvinn 94 dr3 DP 96 dn#cfft2d nofl8 max fl2
104 spiff 104 094.fpppp 106 g4.m36 108 g4.m22 nofl8 max fl2
77 cslalom 77 dd#ALAM18 83 fpppp 86 zhuge nofl8p max fl2
104 sl#11 104 dl#16 108 g4.m35 116 015.doduc nofl8p max fl2
6 intmc1000 15 fpppp 16 dl#10 18 dl#10 nofl8 max gate
53 023.eqntott 54 dd#PRIME 56 dn#gmtry 74 dn#vpenta nofl8 max gate
Faster than 100%?
For several Pentium configurations, there were a few tests that ran
faster than the reference configuration; some of these tests even overcame the
burden of division flaw workaround code. Since relative performance
differences of less than 5% are usually not significant, the interesting tests
attain relative performance in excess of 105%.
There appear to be two causes for such high performance results:
alignment and cache effects. Recall that the "aligned" modifications to GCC
did not optimize the alignment of variables allocated on the stack - the
modifications merely served to stabilize the alignment among the various
*flaw8 configurations. So it is not surprising that by chance some programs
happened to achieve better alignment with the "263" and "noflaw" compilations
while others achieved worse alignment, compared to the stabilized alignment of
the *flaw8 compilations.
Internal and external cache effects may also have a role. Occasional
bad luck will cause cache thrashing with code that might otherwise run faster.
This familiar RISC phenomenon is likely to affect Pentium too.
Differences Files
The following files list the output differences between various
configurations and the reference configuration, noflaw8.max.fix2. They are
not expected to be self-explanatory in detail, but support the previous
general statements.
Comments on specific difference lists
flaw8.max.fix1: There are no differences between fixed Pentium outputs,
with or without workarounds.
noflaw.max.fix2: Two programs show differences due to differences in
alignment. liu SP, a transcendental function test program, demonstrates a bug
in the Intel log1pf function that causes it to access uninitialized storage;
differences in alignment cause different storage to be accessed.
spice3e2.ltra 3 probably demonstrates a bug in the spice3e2 circuit simulator
related to uninitialized storage too. ucblibtest.lgamma demonstrates a bug in
the Intel libm that sometimes causes lgamma(nan) to dump core.
noflaw8.max.flaw2: These differences arise due to the Pentium flaw.
liu DP demonstrates an atan flaw. sucbdiv and ucbpla are sensitive division
test programs designed to test rounding in nearly half-way cases, and to test
SRT division, respectively.
noflaw8p.max.flaw2: These differences arise due to the Pentium flaw in
the test programs and to differences in libm workarounds. Libm workarounds
produce numerical differences in the slalom and savage benchmarks and elefunt
test. The workaround libm introduces gratuitous underflow exceptions in
atan2, which are reported by ucblibtest.
flaw8m.max.flaw1: These differences arise due to Pentium flaws in libm,
and to the division workaround in the test programs. The doduc and hydro2d
SPEC benchmarks show minor numerical differences, along with the EGYPT kernel
from the Digital Review suite, launch-junction, and rollbar. The division
workaround causes ucbdivtest to fail in extended precision by one unit in the
least significant bit rather than in single precision by ten units.
ucbpitest QP, a trigonometric function test program, also shows very small
differences due to the division workaround.
flaw8.max.flaw1: These differences remain after all workarounds are in
applied. They are just those that arose under noflaw8p and flaw8m above.
noflaw8.max.gateway: More extensive than any of the foregoing are the
differences between the transcendental function implementations in 486 and
Pentium. These affect all the application programs that noticed differences
between flawed and unflawed Pentium. The ucbeeftest QP results indicate the
benefit of the improved transcendentals in extended precision: Worst atan er-
ror reduced from 1.3 units to 0.7 units; worst log error reduced from 1.1 to
1.0 units. The ucbflibtest.log2 results indicate an unexpected difference:
the fyl2x instruction is exact for exact arguments (like log2(2)) on 486, but
sets the inexact flag gratuitously on Pentium since the correct exact result
is computed.
Acknowledgements
Intel lent the four Pentium test systems fix1, fix2, flaw1, and flaw2,
and provided copies of its libm.a with and without Pentium flaw workarounds.
Cygnus Software provided the patched version of GCC 2.6.3 supporting
-mno-fpflaw and -mfpflaw at its public ftp site.
SunSoft Developer Products provided some of its compiler/hardware valida-
tion software, the Solaris 2.4 operating system with driver update 3, and the
ProCompiler 2.0.1 C and libm.a, and made fdlibm available at
netlibaresearch.att.com.
AT&T Bell Labs made the F2C Fortran-to-C translator available at
netlibaresearch.att.com.
Advertisement
Studies like this one can be performed to evaluate comparative correct-
ness and performance of computer system hardware, operating systems, compilers
and libraries. Please request a business announcement from dghavalidgh.com.
tbl | troff -ms source for this report are available from
dghavalidgh.com, along with source for other reports listed in the business
announcement.
More information about the Numeric-interest
mailing list