SPEC correctness and performance
David G. Hough at validgh
dgh
Fri Mar 15 19:51:52 PST 1996
The following was posted to comp.benchmarks and comp.arch.arithmetic today.
A B S T R A C T
Each SPEC performance benchmark program should have its own independent
correctness verification program to inspect the benchmark's output.
Assuming that correctness can be verified for each program, specbaseint98
and specbasefp98 should require compilation of all programs with the same com-
pilers and compilation options, and only permit options which produce correct
results for the SPEC programs and for a larger selection of test programs.
The purpose is to indicate relative performance for users (typically mass-
market ISV's) that avoid rewriting Makefiles or source codes.
Assuming that correctness can be verified for each program, specint98 and
specfp98 should permit compilation with different compilers and different com-
piler options for each program, and also permit rewriting source codes for
performance, as long as all changes are disclosed. The purpose is to indi-
cate relative performance for users (typically scientific end users) that are
willing to rewrite Makefiles and source codes to optimize performance of one
particular program on one particular computer.
M E M O R A N D U M
This memorandum is a synopsis of points I made to the SPEC Open Systems
committee meeting, 13 March 1996 in Sunnyvale. The general topic of
discussion was accuracy issues in current and future SPEC performance
benchmark programs. My viewpoint on performance investigations is that their
purpose is to find bad news rather than good news - the weak spots of systems,
which are often veiled until customer shipments begin - rather than their
strong points - which are usually known from the earliest design stages.
I made three principal points in my prepared remarks:
1) Each benchmark program requires its own separate independent correctness
check program.
2) Programs that generate numerical exceptions in their normal course of
operation are acceptable for SPEC.
3) SPEC benchmark programs that use random floating-point numbers should be
adapted to use one common generator based upon integer arithmetic.
Differences in floating-point arithmetic can lead to quite remarkably
different outcomes in Monte Carlo physics simulations.
The ensuing discussion led me to several additional points:
4) An independent correctness test program, while no small matter to obtain,
simplifies a number of other issues that SPEC has discussed from time to
time, such as
optimizations
There is no need to argue about which optimizations are acceptable from
some abstract viewpoint; correctness tests reveal whether they are
legitimate for this particular program.
exceptions
There is no need to argue about whether exceptions are acceptable from
some abstract viewpoint; correctness tests reveal whether they are
acceptable for this particular program on this particular processor.
auto-parallelization
There is no need to argue about whether exploiting multiple processors is
acceptable from some abstract viewpoint; permit systems in which multiple
processors might be applied to one problem - such as two-processor
desktop systems - but not twenty-processor server systems for which
specrates are more appropriate - to gain the benefits of any automatic
parallelization they can extract.
reductions
There is no need to argue about whether auto-parallelized reduction
operations, which change computed results because of differing roundoff
errors, are acceptable; correctness tests reveal whether they are
acceptable for this particular program.
5) The maximum-optimization SPEC ratios, such as specint95 and specfp95,
currently permit very different optimizations to be applied to each
program under test, with amazing results that can be seen in SPEC
disclosures of option lists. But anybody willing to do that kind of work
on a Makefile would probably be willing to rewrite the source code too.
So I believe that these SPEC ratios should permit source code
alterations, provided those alterations are disclosed, as option lists
are now disclosed. Excessive alterations to source code would
presumably be punished in the press. Similarly valuable would be a
mechanism in the SPEC scripts that permitted specifying which source code
input files are to be processed with maximum optimizations and which with
baseline optimizations, since in many realistic applications, only a
small part of the whole program benefits from intensive optimization.
6) The baseline-optimization SPEC ratios require optimizations to be the
same for all programs under test, with some restrictions over what kinds
of optimizations may be used; those restrictions are difficult to devise
and enforce. Instead SPEC should require that the baseline optimizations
must produce correct results with all benchmark programs, but also with
all programs in a supplemental set, some of which are specifically
designed to test correctness rather than performance. The supplemental
set I'd suggest includes SPEC 92 and 95, of course, as well as the
Perfect Club benchmarks, the LAPACK libraries and test programs, and
UCBTEST for IEEE arithmetic and elementary libm functions. (UCBTEST is
available at ftp://netlib.att.com/netlib/fp/ucbtest.tar.gz). That would
insure that optimizations used in baseline SPEC ratios are broadly
applicable.
7) SPEC might as well concentrate on IEEE 754 platforms, and permit
benchmark programs that rely on specific features of IEEE 754 such as
nonstop exception handling and gradual underflow. Non-754 platforms,
such as some Cray supercomputers, IBM mainframes, and DEC VAX mainframes,
are not purchased on the basis of SPECint and SPECfp, and there is little
interest in those measurements on those systems.
Correctness-Checking Programs
A benchmark program to measure performance generally produces some kind
of output, if it is modeling a realistic application; if floating-point
arithmetic is involved, the results may vary among systems, compilers, or
optimization levels. The results may be different but equally correct, or
some may be quite wrong. It may be very difficult for an expert in a
different field to determine whether the output of a computational chemistry
program is correct.
SPEC has dealt with this by comparing results using spiff, a program like
diff that allows numerical quantities to be considered equal if they are
within a tolerance. What tolerance should be used? If the SPEC members
find a tolerance that accepts all their unoptimized outputs, they are inclined
to use that as the tolerance in the SPEC scripts. Then what happens if, on
a new platform, or at a higher level of optimization, the results are somewhat
out of the agreed tolerance but not egregiously so? How can a non-expert tell
which is correct, if any?
The answer is that a non-expert can't tell. Some numerical problems
are ill-conditioned: slight changes in the input data lead to substantial
changes in the output. Different rounding errors can often be shown to be
equivalent to slight changes in the input data; for these problems, many
substantially different outputs may be equally correct, but other
substantially different outputs may not be. A program like spiff can't deal
with that very well. Aggravating the situation, when two or more algorithms
are combined in one program, is the chance that intermediate results are ill-
conditioned with respect to the input and rounding errors, while the final
results are not.
An example of a problematic benchmark is 015.doduc in specfp92. It's a
nuclear reactor simulation, but the output from the benchmark version only
gives the number of iterations required to reach each second of simulated
time. Surely the number of iterations is not the most important output of a
reactor simulation; presumably physical quantities like temperature and
pressure are what's important. While equally correct computations might
require similar numbers of iterations, completely incorrect computations might
happen to fail in a way that doesn't affect the number of iterations. So
some anxiety is justified until an expert can certify that the number of
iterations is a reliable gauge of overall correctness.
So an automatic benchmark suite also needs an automatic correctness test
suite that checks the outputs of the benchmarks. The correctness test
program for a particular benchmark has to be written specifically for that
benchmark. No prescription is universally applicable for writing those
correctness tests; some approaches include
residuals
The linpack benchmark computes the solution x of Ax = b, and then
computes the norm of the residual b - Ax. That residual depends on
every part of the answer x and it is fairly unlikely that the same
optimizer bug would cause x to be wrong and b - Ax to be right. "Solver"
type problems usually have some kind of residual.
invariants
Many physical problems have some kind of invariant that can be checked at
the end - mass, energy, momentum.
independent algorithms
In some cases the problem that the benchmark solves could be solved in a
different way, perhaps slower.
additional input test cases
The timed input test cases should represent the bulk of the computational
load in practice. Having additional untimed input test cases that are
part of the correctness testing procedure is a way to make sure that the
timed cases have not been expedited by optimizations that fail under
other circumstances.
The correctness testing program might well be slower than the benchmark
program itself. The time required for the correctness test is not counted
in the time required for the benchmark - people using computer programs don't
usually continue to check the results once they convince themselves the
program is correct - although perhaps they should.
Where do Correctness-Checking Programs come from?
In general, it is more difficult to write a good test program than to
write the program it tests. The original program was usually developed from
a fairly definite specification of what to do and how to do it, but there is
no "how to" in the specification of the test program. There are several
general approaches as outlined above, but no guarantee that any is appropriate
in a particular case, nor any procedure for deciding which to apply to a
benchmark program in an area of science that's not familiar.
So it's really unrealistic to expect the SPEC volunteers from its member
companies in the computer business to be able to produce such test programs -
they are fully occupied in the porting process for the benchmarks - getting
them to run at all on a variety of platforms is no small accomplishment. It
seems to me that this is a fruitful area for academic involvement - MS theses
in areas like computational chemistry might well be based on writing test
programs for SPEC benchmarks in this area, or rewriting SPEC benchmarks so
that the output can be meaningfully tested. Students and professors with
such interests should contact SPEC.
For more information
tbl | troff -ms source for this memorandum, and for the performance
reports which led me to its conclusions, and earlier commentaries on test
programs, are available from dghavalidgh.com.
More information about the Numeric-interest
mailing list