SPEC correctness and performance

Fri Mar 15 19:51:52 PST 1996

The following was posted to comp.benchmarks and comp.arch.arithmetic today.

A B S T R A C T

Each SPEC performance benchmark program should have its own independent
correctness verification program to inspect the benchmark's output.

     Assuming that correctness can be verified for each program, specbaseint98
and specbasefp98 should require compilation of all programs with the same com-
pilers and compilation options, and only permit options which produce correct
results for the SPEC programs and for a larger selection of test programs.
The purpose is to indicate relative performance for users (typically mass-
market ISV's) that avoid rewriting Makefiles or source codes.

     Assuming that correctness can be verified for each program, specint98 and
specfp98 should permit compilation with different compilers and different com-
piler options for each program, and also permit rewriting source codes for
performance, as long as all changes are disclosed.   The purpose is to indi-
cate relative performance for users (typically scientific end users) that are
willing to rewrite Makefiles and source codes to optimize performance of one
particular program on one particular computer.

M E M O R A N D U M

     This memorandum is a synopsis of points I made to the SPEC Open Systems
committee meeting, 13 March 1996 in Sunnyvale.    The general topic of
discussion was accuracy issues in current and future SPEC performance
benchmark programs.  My viewpoint on performance investigations is that their
purpose is to find bad news rather than good news - the weak spots of systems,
which are often veiled until customer shipments begin -  rather than their
strong points - which are usually known from the earliest design stages.

     I made three principal points in my prepared remarks:

1)   Each benchmark program requires its own separate independent correctness
     check program.

2)   Programs that generate numerical exceptions in their normal course of
     operation are acceptable for SPEC.

3)   SPEC benchmark programs that use random floating-point numbers should be
     adapted to use one common generator based upon integer arithmetic.
     Differences in floating-point arithmetic can lead to quite remarkably
     different outcomes in Monte Carlo physics simulations.

     The ensuing discussion led me to several additional points:

4)   An independent correctness test program, while no small matter to obtain,
     simplifies a number of other issues that SPEC has discussed from time to
     time, such as

optimizations
     There is no need to argue about which optimizations are acceptable from
     some abstract viewpoint; correctness tests reveal whether they are
     legitimate for this particular program.

exceptions
     There is no need to argue about whether exceptions are acceptable from
     some abstract viewpoint; correctness tests reveal whether they are
     acceptable for this particular program on this particular processor.

auto-parallelization
     There is no need to argue about whether exploiting multiple processors is
     acceptable from some abstract viewpoint; permit systems in which multiple
     processors might be applied to one problem - such as two-processor
     desktop systems - but not twenty-processor server systems for which
     specrates are more appropriate - to gain the benefits of any automatic
     parallelization they can extract.

reductions
     There is no need to argue about whether auto-parallelized reduction
     operations, which change computed results because of differing roundoff
     errors, are acceptable; correctness tests reveal whether they are
     acceptable for this particular program.

5)   The maximum-optimization SPEC ratios, such as specint95 and specfp95,
     currently permit very different optimizations to be applied to each
     program under test, with amazing results that can be seen in SPEC
     disclosures of option lists.  But anybody willing to do that kind of work
     on a Makefile would probably be willing to rewrite the source code too.
     So I believe that these SPEC ratios should permit source code
     alterations, provided those alterations are disclosed, as option lists
     are now disclosed.    Excessive alterations to source code would
     presumably be punished in the press.  Similarly valuable would be a
     mechanism in the SPEC scripts that permitted specifying which source code
     input files are to be processed with maximum optimizations and which with
     baseline optimizations, since in many realistic applications, only a
     small part of the whole program benefits from intensive optimization.

6)   The baseline-optimization SPEC ratios require optimizations to be the
     same for all programs under test, with some restrictions over what kinds
     of optimizations may be used; those restrictions are difficult to devise
     and enforce.  Instead SPEC should require that the baseline optimizations
     must produce correct results with all benchmark programs, but also with
     all programs in a supplemental set, some of which are specifically
     designed to test correctness rather than performance.    The supplemental
     set I'd suggest includes SPEC 92 and 95, of course, as well as the
     Perfect Club benchmarks, the LAPACK libraries and test programs, and
     UCBTEST for IEEE arithmetic and elementary libm functions.    (UCBTEST is
     available at ftp://netlib.att.com/netlib/fp/ucbtest.tar.gz).  That would
     insure that optimizations used in baseline SPEC ratios are broadly
     applicable.

7)   SPEC might as well concentrate on IEEE 754 platforms, and permit
     benchmark programs that rely on specific features of IEEE 754 such as
     nonstop exception handling and gradual underflow.    Non-754 platforms,
     such as some Cray supercomputers, IBM mainframes, and DEC VAX mainframes,
     are not purchased on the basis of SPECint and SPECfp, and there is little
     interest in those measurements on those systems.

Correctness-Checking Programs

     A benchmark program to measure performance generally produces some kind
of output, if it is modeling a realistic application; if floating-point
arithmetic is involved, the results may vary among systems, compilers, or
optimization levels.    The results may be different but equally correct, or
some may be quite wrong.     It may be very difficult for an expert in a
different field to determine whether the output of a computational chemistry
program is correct.

     SPEC has dealt with this by comparing results using spiff, a program like
diff that allows numerical quantities to be considered equal if they are
within a tolerance.    What tolerance should be used?    If the SPEC members
find a tolerance that accepts all their unoptimized outputs, they are inclined
to use that as the tolerance in the SPEC scripts.    Then what happens if, on
a new platform, or at a higher level of optimization, the results are somewhat
out of the agreed tolerance but not egregiously so?  How can a non-expert tell
which is correct, if any?

     The answer is that a non-expert can't tell.    Some numerical problems
are ill-conditioned: slight changes in the input data lead to substantial
changes in the output.    Different rounding errors can often be shown to be
equivalent to slight changes in the input data; for these problems, many
substantially different outputs may be equally correct, but other
substantially different outputs may not be.  A program like spiff can't deal
with that very well.    Aggravating the situation, when two or more algorithms
are combined in one program, is the chance that intermediate results are ill-
conditioned with respect to the input and rounding errors, while the final
results are not.

     An example of a problematic benchmark is 015.doduc in specfp92.  It's a
nuclear reactor simulation, but the output from the benchmark version only
gives the number of iterations required to reach each second of simulated
time.    Surely the number of iterations is not the most important output of a
reactor simulation; presumably physical quantities like temperature and
pressure are what's important.   While equally correct computations might
require similar numbers of iterations, completely incorrect computations might
happen to fail in a way that doesn't affect the number of iterations.    So
some anxiety is justified until an expert can certify that the number of
iterations is a reliable gauge of overall correctness.

     So an automatic benchmark suite also needs an automatic correctness test
suite that checks the outputs of the benchmarks.   The correctness test
program for a particular benchmark has to be written specifically for that
benchmark.    No prescription is universally applicable for writing those
correctness tests; some approaches include

residuals
     The linpack benchmark computes the solution x of Ax = b, and then
     computes the norm of the residual b - Ax.   That residual depends on
     every part of the answer x and it is fairly unlikely that the same
     optimizer bug would cause x to be wrong and b - Ax to be right. "Solver"
     type problems usually have some kind of residual.

invariants
     Many physical problems have some kind of invariant that can be checked at
     the end - mass, energy, momentum.

independent algorithms
     In some cases the problem that the benchmark solves could be solved in a
     different way, perhaps slower.

additional input test cases
     The timed input test cases should represent the bulk of the computational
     load in practice.    Having additional untimed input test cases that are
     part of the correctness testing procedure is a way to make sure that the
     timed cases have not been expedited by optimizations that fail under
     other circumstances.

     The correctness testing program might well be slower than the benchmark
program itself.    The time required for the correctness test is not counted
in the time required for the benchmark - people using computer programs don't
usually continue to check the results once they convince themselves the
program is correct - although perhaps they should.

Where do Correctness-Checking Programs come from?

     In general, it is more difficult to write a good test program than to
write the program it tests.    The original program was usually developed from
a fairly definite specification of what to do and how to do it, but there is
no "how to" in the specification of the test program.    There are several
general approaches as outlined above, but no guarantee that any is appropriate
in a particular case, nor any procedure for deciding which to apply to a
benchmark program in an area of science that's not familiar.

     So it's really unrealistic to expect the SPEC volunteers from its member
companies in the computer business to be able to produce such test programs -
they are fully occupied in the porting process for the benchmarks - getting
them to run at all on a variety of platforms is no small accomplishment.  It
seems to me that this is a fruitful area for academic involvement - MS theses
in areas like computational chemistry might well be based on writing test
programs for SPEC benchmarks in this area, or rewriting SPEC benchmarks so
that the output can be meaningfully tested.    Students and professors with
such interests should contact SPEC.

For more information

     tbl | troff -ms source for this memorandum, and for the performance
reports which led me to its conclusions, and earlier commentaries on test
programs, are available from dghavalidgh.com.