request for correctness/performance test programs
David G. Hough at validgh
dgh
Mon Mar 21 20:28:51 PST 1994
In my correctness and performance studies I have a continuing need for
suitable test programs. I'm writing to solicit contributions. Contributed
programs and a test framework will be placed in the public domain when ready.
As with "free" software, the money is in the support, interpretation and
evaluation of results... I hope.
If you do have something to suggest, please tell me about it briefly
before sending code. Please don't publish this request widely at the present
time. If I run out of suggestions, I'll ask elsewhere. comp.benchmarks, for
instance, is full of people with pointless "benchmark" programs that I'd
rather not try to tactfully decline.
This memorandum turned out much longer than I expected. I can send tbl |
troff -ms source to anybody who'd rather print it out.
Affirmative Action
Even though I'm soliciting donations, I have a long list of restrictions
on what I can use effectively, which I will list later. But to put things in
a more positive light, here is a list of some affirmative action properties
that I particularly solicit, roughly in order of importance:
IEEE 754 Exceptions
Programs that generate IEEE 754 exceptions are fine. SPEC won't take
them.
IEEE 754 Implicit Dependence
Programs that implicitly depend on the default properties of IEEE 754
arithmetic are fine. I don't anticipate trying to port to VAX, IBM 370,
or Cray. SPEC still needs to be able to run everything on a VAX to get
reference times.
IEEE 754 Explicit Dependence
Programs that explicitly depend on IEEE 754 features such as remainder,
convert to integer in floating format, the appendix functions, rounding
modes, exception flags, and traps may be OK. The problem is that the
coding supporting use of those features is machine- dependent, often in
assembler. I may be able to hack it if the machine-dependent parts are
very small, well-encapsulated, and critical to the correctness or perfor-
mance of the program. SPEC won't take programs with such explicit depen-
dencies. To indicate the scope of my problem, initially I will try to
get programs working on SunOS 3, 4, 5 including x86, AUX, and Linux, and
I can imagine that AIX, HPUX, IRIX, (Alpha) OSF, and SunOS 5 for PowerPC
may enter the picture later.
Problem Programs
Programs that have turned up correctness failures or performance
anomalies in workstation hardware or software are especially useful.
Especially Suns - the test programs I've been using up until now have had
as many of the problems beaten out of them as possible at Sun. But
please be fairly certain the programs are legal according to the relevant
language standard. SPEC won't take anything that won't work on most of
the member's platforms.
Moderately Large Data Sets
Programs with moderately large data sets for input or output, perhaps up
to perhaps 100MB total, are fine.
Moderately Large Working Sets
Programs with moderately large working sets, up to about 32MB, are fine.
Not CPU-Bound
Programs that are important in some scientific/engineering application
may be of interest, even if they are not bound by floating-point,
integer, or memory bandwidth.
Unix Dependence
Programs that depend in subtle ways on the fact that they're running on
some kind of Unix derivative are fine. I don't expect to port to main-
frame, supercomputer, or PC operating systems that aren't a lot like Unix
because my test structure won't handle them. Programs that are depen-
dent on a very specific Unix derivative aren't so fine.
X Graphics Programs
Programs whose output is in the form of X graphics are useful if the
graphic rendering portion significantly affects performance. More typical
scientific programs with graphical output that mostly compute and then
quickly display the computed result graphically are not worth the trouble
of getting them to compile and link with X on various platforms.
SPEC Rejects
SPEC has necessarily rejected a number of candidate programs for various
reasons - lack of manpower to investigate all of them being an important
one. SPEC rejects are fine for me provided they weren't rejected for one
of the reasons I'd reject them.
Mixed C and Fortran
Mixed language programs are a pain - darn those underscores - but in many
cases it makes a lot of sense to code the I/O in C and the number crunch-
ing in Fortran.
Do All Directives
The least common denominator explicit parallelization directive - "do the
following loop in parallel unconditionally" - seems worth admitting even
though the spelling varies among (mostly Fortran) compilers.
Restrictions
Here, roughly in order of importance, are the restrictions on what I can
use effectively. I realize that some of them tend to be contradictory. Beg-
gars can't be choosy, you say? Perhaps, but for the moment I will hope for
the best.
Freely Distributable
It doesn't do any good to publish correctness or performance evaluations
for which other people can't check your work to determine whether you
know what you're talking about. So I don't want to adopt any more pro-
grams that aren't freely distributable internationally.
Freely Modifiable
I invariably have to modify programs somewhat to fit my test structure,
and typically to standardize the time measurement and the uniform random
number generation, if any.
Realistic Applications of Economic Significance
Kernels are unacceptable, and simplified applications are suspect. The
reason is that people suppose or determine what the bottleneck on a par-
ticular application on a particular system is at a particular time, pro-
duce a drastically simplified benchmark program, and continue to run it
for years. Meanwhile technology changes, and the bottleneck in a par-
ticular application changes, and they wonder why the delivered system
performance on the actual application doesn't match the expectations they
developed benchmarking during the procurement.
On the other hand, realistic applications are often very complex and
messy, and simplifications in that respect are welcome as long as they
don't remove any potential future bottlenecks. "Economic Significance"
means that somebody would save some significant resources if the test
program ran faster - we would like hardware and software designers to
optimize the most economically valuable things.
One way to think of economic significance is this: for a program that
runs enough each year to keep ten workstations busy full time, a 10%
improvement in optimization would free up one workstation-year, which
seems worthwhile ($10000) to me. Correspondingly, for a program that
keeps one supercomputer busy 10% of the time, a 1% optimization would
free up 0.1% supercomputer-year, which might also be about $10000.
Portable Correct Fortran or C
Aside from exploiting IEEE 754 features as outlined above, programs
should be portable among modern PC's and workstations. The principle
correctness issue is avoiding violating language restrictions that optim-
izers depend on, that can't be checked economically at compile time or
run time. The worst of these is aliasing parameters to Fortran subpro-
grams. They are the devil to track down, and look just like optimizer
bugs, because they usually work as expected unoptimized, and typically
worked as expected optimized on all the releases previous to the one that
broke.
Self-Contained
Programs that can completely link with the standard libraries are best.
But I am used to dealing with timing and uniform random number genera-
tors.
Self-Checking
To combat incorrect optimization, it's essential to be able to detect
whether the output is correct or not. The best kind of self-check is a
mathematical residual, like ||b-Ax|| that the Linpack benchmarks compute,
and another good kind for physical problems is that an invariant like
energy or momentum is conserved. The PERFECT programs test some of
their computed results against various error bounds which presumably were
derived from runs believed to be correct, but the different programs
appear to vary in the diligence with which those bounds were derived.
Stable Against Roundoff
If no self-check is obvious, at least the computed results should be
fairly stable against differences in expression evaluation and elementary
transcendental functions, so that correctness of optimized output, for
instance, can be judged by looking at unoptimized output.
Run for 1-10 Minutes on Current Workstations
Programs that run quickly may not be timed very accurately on a multi-
user operating system. Programs that run a long time are acceptable,
but programs that run a moderate amount of time and have the same perfor-
mance characteristics are even more acceptable.
Batch, not Interactive
My test harness runs for days. Input has to be from fixed data files.
Orthogonal to other Programs I use
Programs contribute little useful information whose performance corre-
lates closely, over a variety of platforms, with a program in SPEC or
PERFECT or with one of the other programs I use.
Multiple Input Data Sets
Programs with no input data are suspect for overzealous optimization.
Complicated programs like SPICE should have a variety of input data sets
that exercise different features.
Main Input and Output in ASCII
Programs may have other input, scratch, and output files, but the input
file specifying the computation parameters and the output file containing
the information about whether the results were correct should be in
ASCII.
Current Algorithmic Technology
Programs that solve linear least square problems by explicitly forming
the normal equations and then explicitly forming the matrix inverse do
not represent current technology. I'd prefer something more competent,
but this has to be balanced against the economic significance reality
that people pay for making their existing programs run faster with as
little change as possible.
One per Person
Since I have limited bandwidth for processing contributions, I'd appreci-
ate it if interested persons nominate their one best candidate; I'll ask
for more when I finish with those.
Programs currently used
In my "Searching for a Solaris Workstation" report, I used the following
programs to test correctness and performance:
spiff
f2c
SPECint92
SPECfp92
PERFECT
spice3e2
SPEC, PERFECT, and SPICE are not freely distributable, but they are more or
less readily obtainable. spiff and f2c are freely distributable.
Current candidates for addition to the suite, pending confirmation of
free distributability:
barroln
Fortran three-dimensional steady state deformation in bar rolling, by K.
Mori of Kyoto Institute of Technology.
cslalom
C translation of optical radiosity benchmark by Gustafson et al, Ames.
goliath
Fortran Exact Rational System Analyzer by Alfeld and Eyre of Utah.
herwigFortran Monte Carlo simulation of Hadron/Gluon interactions, from Parma
and Cambridge.
hydroFortran two-dimensional lagrangian hydrodynamics, from Los Alamos.
isajetFortran Monte Carlo simulation of particle interactions, by Paige and
Protopopescu, Brookhaven.
jetsetFortran Monte Carlo jet fragmentation, by Sjostrand and Bengtsson from
Lund.
rayshade 406
C ray tracing program, by Kolb, Bogart et al of Princeton.
reweight
Fortran Monte Carlo simulation of particles through a detector, by Sey-
mour of Washington University.
sim C comparison of two sequences of DNA, by Huang and Miller of Penn State.
tc8 Fortran mathematical group theory, from University of Sydney, NSW.
wondyFortran solid dynamics simulation, from Sandia.
1000x1000 linpack
Fortran based on LAPACK and [CDSZ]GEMM BLA
thesisIf I had a way to extract the source code for my thesis work from a
CDC-6400 tape made in 1977, I might make a test program out of that -
Fortran searching for minima of a complicated non-analytic complex func-
tion, corresponding to the nearest polynomial with a double zero.
Several of these candidates deviate in some respect from the ideals
listed above, and so might be superseded. Thus the 1000x1000 linpack is an
exception to the rule against kernels, and subject to deletion if a suitable
substitute comes along, and optimizing my thesis research programs is of ques-
tionable economic utility.
More information about the Numeric-interest
mailing list