More results and comments about gradual underflow
Nelson H. F. Beebe
beebeamath.utah.edu
Mon Dec 6 13:55:45 PST 1999
I've just run Vaughan Pratt's <prattacs.stanford.edu> gradual
underflow test program on several local architectures that reflect a
broader view of the desktop hardware of the 1990s. The results are
tabulated below.
Points to note are:
(1) HP PA-RISC results depend drastically on optimization;
(2) IBM PowerPC and Intel Pentium II and III handle
denormalized values quickly in hardware;
(3) Motorola 68040 handles denormalized values in software;
(4) DEC Alpha (DECchip 21040-AA) handles denormalized values
in software, but only if you ask to (with the -ieee flag
in C, or -fpe3 or -fpe4 in Fortran); otherwise, you get
flush-to-zero. The DEC Alpha 21264 chips [see
R. E. Kessler, ``The Alpha 21264 Microprocessor'', IEEE
Micro, 19(2), 24--36, March/April 1999 for details] have
hardware handling of exceptional floating-point values,
but I don't have access to a system with that chip level
to make measurements;
(5) Later MIPS chips handle denormalized values in hardware,
but earlier chips do it in software.
I wonder what will happen in HP/Intel Merced (aka IA64)?
The moral for benchmarkers is: you need to try multiple optimization
levels, and ideally, remove all gradual underflows (and NaNs and
Infinities) from your benchmark program; otherwise, your results may
be seriously skewed by software handling of denormalized values.
================================================================================
Vendor/Model O/S user system elapsed CPU
================================================================================
xxx MHz Apple Rhapsody 5.5 0.180u 0.010s 0:00.18 105.5% (cc -g)
PowerMac G3 0.090u 0.010s 0:00.08 125.0% (cc -O1)
0.080u 0.020s 0:00.08 125.0% (cc -O2)
0.080u 0.020s 0:00.08 125.0% (cc -O3)
0.080u 0.020s 0:00.08 125.0% (cc -O4)
# NB: Output is: 1023 2.22507e-308 (i.e., flush-to-zero without gradual underflow):
466MHz DEC Alpha OSF/1 4.0g 0.029u 0.006s 0:00.04 50.0% (c89 -g)
0.020u 0.005s 0:00.07 28.5% (c89 -O1)
0.023u 0.005s 0:00.03 66.6% (c89 -O2)
0.021u 0.006s 0:00.03 66.6% (c89 -O3)
0.021u 0.009s 0:00.03 66.6% (c89 -O4)
# NB: Output is: 1075 4.94066e-324:
466MHz DEC Alpha OSF/1 4.0g 0.155u 27.103s 0:27.30 99.8% (c89 -ieee -g)
0.235u 26.904s 0:27.15 99.9% (c89 -ieee -O1)
0.254u 26.896s 0:27.23 99.6% (c89 -ieee -O2)
0.332u 26.865s 0:27.22 99.8% (c89 -ieee -O3)
0.387u 27.050s 0:27.45 99.9% (c89 -ieee -O4)
# NB: For +O3 and +O4, the compiler optimized away the final loop:
99 MHz HP-UX 10.01 12.23u 0.03s 0:12.32 99.5% (c89 -g)
HP-9000/735 12.01u 0.03s 0:12.07 99.7% (c89 -O)
12.17u 0.03s 0:12.25 99.5% (c89 +O1)
12.01u 0.03s 0:12.09 99.5% (c89 +O2)
0.01u 0.03s 0:00.04 100.0% (c89 +O3)
0.01u 0.03s 0:00.04 100.0% (c89 +O4)
# NB: Output correct for -g, but get "16446 0" for all -On
600 MHz Intel GNU/Linux 1.650u 0.000s 0:01.65 100.0% (gcc -g)
Pentium III 2.2.12-20smp 0.000u 0.000s 0:00.00 0.0% (gcc -O1)
(Redhat 6.1) 1.190u 0.000s 0:01.19 100.0% (gcc -O2)
1.190u 0.000s 0:01.19 100.0% (gcc -O3)
1.190u 0.000s 0:01.19 100.0% (gcc -O4)
# NB: Output correct for -g, but get "16446 0" for all -On. Here, cc
# == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
600 MHz Intel GNU/Linux 1.650u 0.000s 0:01.65 100.0% (cc -ffloat-store -g)
Pentium III 2.2.12-20smp 1.680u 0.000s 0:01.68 100.0% (cc -ffloat-store -O1)
(Redhat 6.1) 0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O2)
0.820u 0.010s 0:00.83 100.0% (cc -ffloat-store -O3)
0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O4)
# NB: cc == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
300 MHz Intel GNU/Linux 3.550u 0.020s 0:03.67 97.2% (cc -ffloat-store -g)
Pentium II MMX 2.2.5-22 3.560u 0.030s 0:03.81 94.2% (cc -ffloat-store -O1)
(Redhat 6.0) 1.670u 0.050s 0:01.81 95.0% (cc -ffloat-store -O2)
1.730u 0.030s 0:01.82 96.7% (cc -ffloat-store -O3)
1.710u 0.000s 0:01.75 97.7% (cc -ffloat-store -O4)
1.650u 0.010s 0:01.87 88.7% (cc -ffloat-store -O5)
xxx MHz IBM AIX 4.2 0.320u 0.020s 0:00.37 91.8% (c89 -g)
RS/6000 43P 0.170u 0.010s 0:00.17 105.8% (c89 -O1)
0.170u 0.010s 0:00.17 105.8% (c89 -O2)
0.150u 0.020s 0:00.16 106.2% (c89 -O3)
33MHz Motorola NeXT Mach 3.3 1.093u 271.342s 5:08.20 88.3% (gcc -g)
68040 0.952u 128.427s 2:11.70 98.2% (gcc -O1)
1.265u 127.940s 2:11.80 98.0% (gcc -O2)
0.843u 128.065s 2:18.11 93.3% (gcc -O3)
1.078u 128.140s 2:11.82 98.0% (gcc -O4)
150 MHz SGI IRIX 5.3 8.762u 20.656s 0:29.48 99.7% (cc -ansi -g)
Challenge L 8.818u 14.902s 0:23.26 101.9% (cc -ansi -O1)
MIPS R4400 5.512u 12.547s 0:17.55 102.8% (cc -ansi -O2)
5.516u 12.564s 0:17.70 102.0% (cc -ansi -O3)
# NB: For -O2 and -O3, the compiler optimized away the final loop:
180 MHz SGI IRIX 6.5 0.115u 0.006s 0:00.12 91.6% (c89 -g)
Origin 200 0.126u 0.006s 0:00.12 100.0% (c89 -O1)
MIPS R10000 0.003u 0.006s 0:00.00 0.0% (c89 -O2)
0.003u 0.006s 0:00.00 0.0% (c89 -O3)
400 MHz Sun Solaris 2.7 2.23u 11.24s 0:13.55 99.4% (c89 -g)
UltraSPARC 1.95u 11.27s 0:13.28 99.5% (c89 -O1)
Enterprise 5500 2.08u 11.45s 0:13.53 100.0% (c89 -O2)
1.99u 11.30s 0:13.31 99.8% (c89 -O3)
2.22u 11.09s 0:13.33 99.8% (c89 -O4)
1.96u 11.36s 0:13.34 99.8% (c89 -O5)
================================================================================
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- Center for Scientific Computing FAX: +1 801 585 1640, +1 801 581 4148 -
- University of Utah Internet e-mail: beebeamath.utah.edu -
- Department of Mathematics, 322 INSCC beebeaacm.org -
- 155 S 1400 E RM 233 beebeaieee.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe -
-------------------------------------------------------------------------------
More information about the Numeric-interest
mailing list