More results and comments about gradual underflow

Mon Dec 6 13:55:45 PST 1999

I've just run Vaughan Pratt's <prattacs.stanford.edu> gradual
underflow test program on several local architectures that reflect a
broader view of the desktop hardware of the 1990s. The results are
tabulated below.

Points to note are:

	(1) HP PA-RISC results depend drastically on optimization;

	(2) IBM PowerPC and Intel Pentium II and III handle
	    denormalized values quickly in hardware;

	(3) Motorola 68040 handles denormalized values in software;

	(4) DEC Alpha (DECchip 21040-AA) handles denormalized values
	    in software, but only if you ask to (with the -ieee flag
	    in C, or -fpe3 or -fpe4 in Fortran); otherwise, you get
	    flush-to-zero.  The DEC Alpha 21264 chips [see
	    R. E. Kessler, ``The Alpha 21264 Microprocessor'', IEEE
	    Micro, 19(2), 24--36, March/April 1999 for details] have
	    hardware handling of exceptional floating-point values,
	    but I don't have access to a system with that chip level
	    to make measurements;

	(5) Later MIPS chips handle denormalized values in hardware,
	    but earlier chips do it in software.

I wonder what will happen in HP/Intel Merced (aka IA64)?

The moral for benchmarkers is: you need to try multiple optimization
levels, and ideally, remove all gradual underflows (and NaNs and
Infinities) from your benchmark program; otherwise, your results may
be seriously skewed by software handling of denormalized values.

================================================================================
Vendor/Model     O/S               user system  elapsed   CPU
================================================================================
xxx MHz Apple    Rhapsody 5.5      0.180u 0.010s 0:00.18 105.5% (cc -g)
PowerMac G3                        0.090u 0.010s 0:00.08 125.0% (cc -O1)
				   0.080u 0.020s 0:00.08 125.0% (cc -O2)
				   0.080u 0.020s 0:00.08 125.0% (cc -O3)
				   0.080u 0.020s 0:00.08 125.0% (cc -O4)

# NB: Output is: 1023 2.22507e-308 (i.e., flush-to-zero without gradual underflow):
466MHz DEC Alpha OSF/1 4.0g        0.029u 0.006s 0:00.04 50.0% (c89 -g)
                                   0.020u 0.005s 0:00.07 28.5% (c89 -O1)
                                   0.023u 0.005s 0:00.03 66.6% (c89 -O2)
                                   0.021u 0.006s 0:00.03 66.6% (c89 -O3)
                                   0.021u 0.009s 0:00.03 66.6% (c89 -O4)

# NB: Output is: 1075 4.94066e-324:
466MHz DEC Alpha OSF/1 4.0g        0.155u 27.103s 0:27.30 99.8% (c89 -ieee -g)
				   0.235u 26.904s 0:27.15 99.9% (c89 -ieee -O1)
				   0.254u 26.896s 0:27.23 99.6% (c89 -ieee -O2)
				   0.332u 26.865s 0:27.22 99.8% (c89 -ieee -O3)
				   0.387u 27.050s 0:27.45 99.9% (c89 -ieee -O4)

# NB: For +O3 and +O4, the compiler optimized away the final loop:
99 MHz           HP-UX 10.01       12.23u 0.03s 0:12.32  99.5% (c89 -g)
HP-9000/735			   12.01u 0.03s 0:12.07  99.7% (c89 -O)
				   12.17u 0.03s 0:12.25  99.5% (c89 +O1)
				   12.01u 0.03s 0:12.09  99.5% (c89 +O2)
				   0.01u  0.03s 0:00.04 100.0% (c89 +O3)
				   0.01u  0.03s 0:00.04 100.0% (c89 +O4)

# NB: Output correct for -g, but get "16446 0" for all -On
600 MHz Intel    GNU/Linux         1.650u 0.000s 0:01.65 100.0% (gcc -g)
Pentium III      2.2.12-20smp      0.000u 0.000s 0:00.00   0.0% (gcc -O1)
                 (Redhat 6.1)      1.190u 0.000s 0:01.19 100.0% (gcc -O2)
		                   1.190u 0.000s 0:01.19 100.0% (gcc -O3)
		                   1.190u 0.000s 0:01.19 100.0% (gcc -O4)

# NB: Output correct for -g, but get "16446 0" for all -On.  Here, cc
# == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
600 MHz Intel    GNU/Linux         1.650u 0.000s 0:01.65 100.0% (cc -ffloat-store -g)
Pentium III      2.2.12-20smp      1.680u 0.000s 0:01.68 100.0% (cc -ffloat-store -O1)
                 (Redhat 6.1)      0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O2)
				   0.820u 0.010s 0:00.83 100.0% (cc -ffloat-store -O3)
				   0.830u 0.000s 0:00.83 100.0% (cc -ffloat-store -O4)

# NB: cc == egcs-2.91.66; tests with gcc 2.95.2 showed that it ignored the
# -ffloat-store option.
300 MHz Intel    GNU/Linux         3.550u 0.020s 0:03.67 97.2% (cc -ffloat-store -g)
Pentium II MMX   2.2.5-22          3.560u 0.030s 0:03.81 94.2% (cc -ffloat-store -O1)
                 (Redhat 6.0)      1.670u 0.050s 0:01.81 95.0% (cc -ffloat-store -O2)
		                   1.730u 0.030s 0:01.82 96.7% (cc -ffloat-store -O3)
				   1.710u 0.000s 0:01.75 97.7% (cc -ffloat-store -O4)
				   1.650u 0.010s 0:01.87 88.7% (cc -ffloat-store -O5)

xxx MHz IBM      AIX 4.2           0.320u 0.020s 0:00.37 91.8%  (c89 -g)
RS/6000 43P		           0.170u 0.010s 0:00.17 105.8% (c89 -O1)
                                   0.170u 0.010s 0:00.17 105.8% (c89 -O2)
                                   0.150u 0.020s 0:00.16 106.2% (c89 -O3)

33MHz Motorola   NeXT Mach 3.3     1.093u 271.342s 5:08.20 88.3% (gcc -g)
68040			           0.952u 128.427s 2:11.70 98.2% (gcc -O1)
				   1.265u 127.940s 2:11.80 98.0% (gcc -O2)
				   0.843u 128.065s 2:18.11 93.3% (gcc -O3)
				   1.078u 128.140s 2:11.82 98.0% (gcc -O4)

150 MHz SGI      IRIX 5.3          8.762u 20.656s 0:29.48  99.7% (cc -ansi -g)
Challenge L                        8.818u 14.902s 0:23.26 101.9% (cc -ansi -O1)
MIPS R4400			   5.512u 12.547s 0:17.55 102.8% (cc -ansi -O2)
				   5.516u 12.564s 0:17.70 102.0% (cc -ansi -O3)

# NB: For -O2 and -O3, the compiler optimized away the final loop:
180 MHz SGI      IRIX 6.5          0.115u 0.006s 0:00.12  91.6%   (c89 -g)
Origin 200                         0.126u 0.006s 0:00.12 100.0%   (c89 -O1)
MIPS R10000			   0.003u 0.006s 0:00.00   0.0%   (c89 -O2)
				   0.003u 0.006s 0:00.00   0.0%   (c89 -O3)

400 MHz Sun      Solaris 2.7       2.23u 11.24s 0:13.55  99.4% (c89 -g)
UltraSPARC                         1.95u 11.27s 0:13.28  99.5% (c89 -O1)
Enterprise 5500                    2.08u 11.45s 0:13.53 100.0% (c89 -O2)
                                   1.99u 11.30s 0:13.31  99.8% (c89 -O3)
				   2.22u 11.09s 0:13.33  99.8% (c89 -O4)
				   1.96u 11.36s 0:13.34  99.8% (c89 -O5)
================================================================================

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- Center for Scientific Computing       FAX: +1 801 585 1640, +1 801 581 4148 -
- University of Utah                    Internet e-mail: beebeamath.utah.edu  -
- Department of Mathematics, 322 INSCC                   beebeaacm.org        -
- 155 S 1400 E RM 233                                    beebeaieee.org       -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe  -
-------------------------------------------------------------------------------