The cost of underflows: some empirical data

Mon Aug 30 15:24:49 PDT 1993

Two weeks ago, I sent a note (reproduced below) to David Hough for
comment, and at his suggestion, am reposting it to the nceg list.

------------------------------------------------------------------------
Date: Tue, 17 Aug 93 18:42:50 MDT
From: "Nelson H. F. Beebe" <beebeamath.utah.edu>
Subject: Performance impact of underflow on IEEE 754 systems

A problem report from one of our large computational users this
morning led me to launch a small investigation into the performance
impact of underflow on several local systems of interest.  The test
code looks something like this (single and double precisions in C and
Fortran are available on request; the timing function, second(), must
be provided externally; I used the version in the PLOT79 libraries
available on most of these systems):

      integer N
      parameter (N = 1000000)
      double precision x,y,z
      real second,t1,t2,dummy

      t1 = second(dummy)
      y = 1.0d0
      z = 1.0d0
      do 10 k = 1,N
          x = y/z
          call foo(x,y,z)
   10 continue
      t1 = second(dummy) - t1
      write (6,*) 'Time (no underflow): ',t1,' sec'

      t2 = second(dummy)
      y = 1.0d-300
      z = 1.0d+300
      do 20 k = 1,N
          x = y/z
          call foo(x,y,z)
   20 continue
      t2 = second(dummy) - t2
      write (6,*) 'Time (underflow): ',t2,' sec'
      end
      subroutine foo(a,b,c)
      double precision a,b,c
      end

The results are summarized below, but I'm curious about your
experience in this area.  I was surprised between the differences
between SunOS 4.1.3 and Solaris 2.2 (or does it arise from ELC vs
10/41 floating-point hardware differences?).  I had mistakenly assumed
that most IEEE 754 implementations would not require software
involvement for underflows, so that there would be no performance
penalty.  The results indicates that this is clearly not the case.

%% /u/sy/beebe/courses/ufl/ufldp.readme, Tue Aug 17 18:08:16 1993
%% Edit by Nelson H. F. Beebe <beebeahoneycomb.math.utah.edu>

==============================================================================
Double-precision underflow benchmark results (times in CPU seconds).

Host                    Architecture and O/S    Time (no ufl) Time (ufl) Ratio

jeeves.math.utah.edu    DECstation 5000 Ultrix 4.2      0.15      0.16    1.0
adam.math.utah.edu      HP 9000/735 HP-UX 9.0           0.58     21.19   36.5
ee.utah.edu             HP 9000/835 HP-UX 9.0           5.26    155.34   28.5
snow.usi.utah.edu       IBM PS/2 model 70 AIX 2.1     648.5     741.8     1.1
osiris.usi.utah.edu     IBM RS/6000-550 AIX 3.2.3       0.51      0.66    1.3
avalanche.usi.utah.edu  IBM 3090/600S-VF AIX 2.1        1.66    295.0   177.7
jabberwocky.math.utah.edu NeXT Mach 3.0                 2.52     28.47   11.3
eros.math.utah.edu      SGI Indigo R3000 IRIX 4.0.5     1.14     70.43   61.8
honeycomb.math.utah.edu SGI Indigo R4000 IRIX 4.0.5     0.64     28.22   44.0
graphics.math.utah.edu  Stardent 1520 OS 2.2            4.25      4.26    1.0
magna.math.utah.edu     Sun ELC SunOS 4.1.3             2.57    504.13  196.2
magna.math.utah.edu     Sun ELC SunOS 4.1.3 (nonstd)    2.80      2.93    1.0
plot79.math.utah.edu    Sun SS10/41 Solaris 2.2         0.87      0.89    1.0
==============================================================================

Ratios in the last column show the penalty of having underflows; i.e.
values > 1.0 mean that underflows slow down the computation.

With the exception of the IBM 3090/600S-VF, all of these systems have
IEEE 754 arithmetic in hardware, which was designed to support
non-stop computation in the presence of underflow, overflow, infinity,
and NaN, so it is rather surprising that so many of them carry a
serious execution-time penalty for underflows.  The little benchmark
program was prepared to investigate a report by a user who found that
a certain iterative process in a large fluid-flow problem would take
widely differing times for the iterations, when similar times were
expected.  This happened on an SGI Indigo R4000 system, and when the
computation was subsequently adjusted to avoid underflow, the code ran
about 10 times faster.

Examination of the MIPS RISC Architecture book (G. Kane & J.
Heinreich, Prentice-Hall, 1992) reveals that while it is possible to
enable or disable traps for the 5 exception states (invalid operation,
division by zero, overflow, underflow, and inexact) on the R2000,
R3000, and R6000 processors, on the R4000, floating-point exceptions
cannot be disabled.

Attempts to disable trapping by calls to set_fpc_csr() were
unsuccessful on both SGI R3000 and R4000 systems in the tables below;
in fact, the control status word returned by get_fpc_csr() was zero,
indicating no traps had been set.  Curiously, the DECstation 5000 has
a MIPS R3000 processor, and exhibits no serious impact from underflow.

The Stardent 1520 is based on the MIPS R2000 processor, but has a
separate proprietary floating-point unit which does not support
gradual underflow.

On the Sun ELC, two results are given, the first for the normal code,
and the second for a version containing a call to a library routine,
nonstandard_arithmetic(), which forces abrupt underflow instead of
gradual underflow.

==============================================================================
Single-precision underflow benchmark results (times in CPU seconds)

Host                    Architecture and O/S    Time (no ufl) Time (ufl) Ratio

jeeves.math.utah.edu    DECstation 5000 Ultrix 4.2      0.17      0.18    1.0
adam.math.utah.edu      HP 9000/735 HP-UX 9.0           0.51     12.37   24.2
ee.utah.edu             HP 9000/835 HP-UX 9.0           4.37     81.98   18.7
snow.usi.utah.edu       IBM PS/2 model 70 AIX 2.1     588.8     638.2     1.1
avalanche.usi.utah.edu  IBM 3090/600S-VF AIX 2.1        1.66    295.0   177.7
osiris.usi.utah.edu     IBM RS/6000-550 AIX 3.2.3       0.53      0.68    1.3
jabberwocky.math.utah.edu NeXT Mach 3.0                 2.53     25.80   10.1
eros.math.utah.edu      SGI Indigo R3000 IRIX 4.0.5     0.93     44.70   48.1
honeycomb.math.utah.edu SGI Indigo R4000 IRIX 4.0.5     0.47     16.56   35.2
graphics.math.utah.edu  Stardent 1520 OS 2.2            4.70      4.76    1.0
magna.math.utah.edu     Sun ELC SunOS 4.1.3             1.90    479.27  252.2
magna.math.utah.edu     Sun ELC SunOS 4.1.3 (nonstd)    2.18      2.30    1.0
plot79.math.utah.edu    Sun SS10/41 Solaris 2.2         0.74      0.76    1.0
==============================================================================

------------------------------------------------------------------------

David Hough provided this feedback:

>> ...
>> ... Most RISC chips are heavily pipelined and do not handle subnormal
>> operands or underflow exceptions in hardware because that requires
>> either extra special-purpose hareware that is seldom used by most
>> programs, or extra cycles on exceptional multiplications that disrupt
>> the normal pipeline; the IEEE 754 requirements are best met by system
>> software.  CISC chips are usually not heavily pipelined and can handle
>> variable instruction timings and so can use available shifter hardware
>> for the pre and post normalizations.
>>
>> The SuperSPARC and MicroSPARC chips handle subnormal operands and
>> results in hardware and so pay no special penalty.  Earlier SPARC
>> chips did not, and so they offer a faster nonstandard mode in which
>> zeros are substituted for subnormals.  Future SPARC chips may revert
>> to the former practice.
>>
>> ...

Perhaps other folks on this list might report their own experience,
and offer further benchmark data for other architectures.  I can
supply the complete C and Fortran code on request.

========================================================================
Nelson H. F. Beebe                      Tel: +1 801 581 5254
Center for Scientific Computing         FAX: +1 801 581 4148
Department of Mathematics, 105 JWB      Internet: beebeamath.utah.edu
University of Utah
Salt Lake City, UT 84112, USA
========================================================================