request for correctness/performance test programs

Mon Mar 21 20:28:51 PST 1994

     In my correctness and performance studies I have a continuing need for
suitable test programs.    I'm writing to solicit contributions.  Contributed
programs and a test framework will be placed in the public domain when ready.
As with "free" software, the money is in the support, interpretation and
evaluation of results... I hope.

     If you do have something to suggest, please tell me about it briefly
before sending code.  Please don't publish this request widely at the present
time.  If I run out of suggestions, I'll ask elsewhere.  comp.benchmarks, for
instance, is full of people with pointless "benchmark" programs that I'd
rather not try to tactfully decline.

     This memorandum turned out much longer than I expected.  I can send tbl |
troff -ms source to anybody who'd rather print it out.

Affirmative Action

     Even though I'm soliciting donations, I have a long list of restrictions
on what I can use effectively, which I will list later.   But to put things in
a more positive light, here is a list of some affirmative action properties
that I particularly solicit, roughly in order of importance:

IEEE 754 Exceptions
     Programs that generate IEEE 754 exceptions are fine.  SPEC won't take
     them.

IEEE 754 Implicit Dependence
     Programs that implicitly depend on the default properties of IEEE 754
     arithmetic are fine.   I don't anticipate trying to port to VAX, IBM 370,
     or Cray.  SPEC still needs to be able to run everything on a VAX to get
     reference times.

IEEE 754 Explicit Dependence
     Programs that explicitly depend on IEEE 754 features such as remainder,
     convert to integer in floating format, the appendix functions, rounding
     modes, exception flags, and traps may be OK.  The problem is that the
     coding supporting use of those features is machine- dependent, often in
     assembler. I may be able to hack it if the machine-dependent parts are
     very small, well-encapsulated, and critical to the correctness or perfor-
     mance of the program. SPEC won't take programs with such explicit depen-
     dencies.  To indicate the scope of my problem, initially I will try to
     get programs working on SunOS 3, 4, 5 including x86, AUX, and Linux, and
     I can imagine that AIX, HPUX, IRIX, (Alpha) OSF, and SunOS 5 for PowerPC
     may enter the picture later.

Problem Programs
     Programs that have turned up correctness failures or performance
     anomalies in workstation hardware or software are especially useful.
     Especially Suns - the test programs I've been using up until now have had
     as many of the problems beaten out of them as possible at Sun.  But
     please be fairly certain the programs are legal according to the relevant
     language standard.  SPEC won't take anything that won't work on most of
     the member's platforms.

Moderately Large Data Sets
     Programs with moderately large data sets for input or output, perhaps up
     to perhaps 100MB total, are fine.

Moderately Large Working Sets
     Programs with moderately large working sets, up to about 32MB, are fine.

Not CPU-Bound
     Programs that are important in some scientific/engineering application
     may be of interest, even if they are not bound by floating-point,
     integer, or memory bandwidth.

Unix Dependence
     Programs that depend in subtle ways on the fact that they're running on
     some kind of Unix derivative are fine.   I don't expect to port to main-
     frame, supercomputer, or PC operating systems that aren't a lot like Unix
     because my test structure won't handle them.   Programs that are depen-
     dent on a very specific Unix derivative aren't so fine.

X Graphics Programs
     Programs whose output is in the form of X graphics are useful if the
     graphic rendering portion significantly affects performance. More typical
     scientific programs with graphical output that mostly compute and then
     quickly display the computed result graphically are not worth the trouble
     of getting them to compile and link with X on various platforms.

SPEC Rejects
     SPEC has necessarily rejected a number of candidate programs for various
     reasons - lack of manpower to investigate all of them being an important
     one.  SPEC rejects are fine for me provided they weren't rejected for one
     of the reasons I'd reject them.

Mixed C and Fortran
     Mixed language programs are a pain - darn those underscores - but in many
     cases it makes a lot of sense to code the I/O in C and the number crunch-
     ing in Fortran.

Do All Directives
     The least common denominator explicit parallelization directive - "do the
     following loop in parallel unconditionally" - seems worth admitting even
     though the spelling varies among (mostly Fortran) compilers.

Restrictions

     Here, roughly in order of importance, are the restrictions on what I can
use effectively.   I realize that some of them tend to be contradictory.  Beg-
gars can't be choosy, you say?  Perhaps, but for the moment I will hope for
the best.

Freely Distributable
     It doesn't do any good to publish correctness or performance evaluations
     for which other people can't check your work to determine whether you
     know what you're talking about.  So I don't want to adopt any more pro-
     grams that aren't freely distributable internationally.

Freely Modifiable
     I invariably have to modify programs somewhat to fit my test structure,
     and typically to standardize the time measurement and the uniform random
     number generation, if any.

Realistic Applications of Economic Significance
     Kernels are unacceptable, and simplified applications are suspect.   The
     reason is that people suppose or determine what the bottleneck on a par-
     ticular application on a particular system is at a particular time, pro-
     duce a drastically simplified benchmark program, and continue to run it
     for years.   Meanwhile technology changes, and the bottleneck in a par-
     ticular application changes, and they wonder why the delivered system
     performance on the actual application doesn't match the expectations they
     developed benchmarking during the procurement.

     On the other hand, realistic applications are often very complex and
     messy, and simplifications in that respect are welcome as long as they
     don't remove any potential future bottlenecks.  "Economic Significance"
     means that somebody would save some significant resources if the test
     program ran faster - we would like hardware and software designers to
     optimize the most economically valuable things.

     One way to think of economic significance is this: for a program that
     runs enough each year to keep ten workstations busy full time, a 10%
     improvement in optimization would free up one workstation-year, which
     seems worthwhile ($10000) to me.    Correspondingly, for a program that
     keeps one supercomputer busy 10% of the time, a 1% optimization would
     free up 0.1% supercomputer-year, which might also be about $10000.

Portable Correct Fortran or C
     Aside from exploiting IEEE 754 features as outlined above, programs
     should be portable among modern PC's and workstations.  The principle
     correctness issue is avoiding violating language restrictions that optim-
     izers depend on, that can't be checked economically at compile time or
     run time.  The worst of these is aliasing parameters to Fortran subpro-
     grams.   They are the devil to track down, and look just like optimizer
     bugs, because they usually work as expected unoptimized, and typically
     worked as expected optimized on all the releases previous to the one that
     broke.

Self-Contained
     Programs that can completely link with the standard libraries are best.
     But I am used to dealing with timing and uniform random number genera-
     tors.

Self-Checking
     To combat incorrect optimization, it's essential to be able to detect
     whether the output is correct or not.   The best kind of self-check is a
     mathematical residual, like ||b-Ax|| that the Linpack benchmarks compute,
     and another good kind for physical problems is that an invariant like
     energy or momentum is conserved.   The PERFECT programs test some of
     their computed results against various error bounds which presumably were
     derived from runs believed to be correct, but the different programs
     appear to vary in the diligence with which those bounds were derived.

Stable Against Roundoff
     If no self-check is obvious, at least the computed results should be
     fairly stable against differences in expression evaluation and elementary
     transcendental functions, so that correctness of optimized output, for
     instance, can be judged by looking at unoptimized output.

Run for 1-10 Minutes on Current Workstations
     Programs that run quickly may not be timed very accurately on a multi-
     user operating system.   Programs that run a long time are acceptable,
     but programs that run a moderate amount of time and have the same perfor-
     mance characteristics are even more acceptable.

Batch, not Interactive
     My test harness runs for days.   Input has to be from fixed data files.

Orthogonal to other Programs I use
     Programs contribute little useful information whose performance corre-
     lates closely, over a variety of platforms, with a program in SPEC or
     PERFECT or with one of the other programs I use.

Multiple Input Data Sets
     Programs with no input data are suspect for overzealous optimization.
     Complicated programs like SPICE should have a variety of input data sets
     that exercise different features.

Main Input and Output in ASCII
     Programs may have other input, scratch, and output files, but the input
     file specifying the computation parameters and the output file containing
     the information about whether the results were correct should be in
     ASCII.

Current Algorithmic Technology
     Programs that solve linear least square problems by explicitly forming
     the normal equations and then explicitly forming the matrix inverse do
     not represent current technology.   I'd prefer something more competent,
     but this has to be balanced against the economic significance reality
     that people pay for making their existing programs run faster with as
     little change as possible.

One per Person
     Since I have limited bandwidth for processing contributions, I'd appreci-
     ate it if interested persons nominate their one best candidate; I'll ask
     for more when I finish with those.

Programs currently used

     In my "Searching for a Solaris Workstation" report, I used the following
programs to test correctness and performance:

spiff
f2c
SPECint92
SPECfp92
PERFECT
spice3e2

SPEC, PERFECT, and SPICE are not freely distributable, but they are more or
less readily obtainable.   spiff and f2c are freely distributable.

     Current candidates for addition to the suite, pending confirmation of
free distributability:

barroln
     Fortran three-dimensional steady state deformation in bar rolling, by K.
     Mori of Kyoto Institute of Technology.

cslalom
     C translation of optical radiosity benchmark by Gustafson et al, Ames.

goliath
     Fortran Exact Rational System Analyzer by Alfeld and Eyre of Utah.

herwigFortran Monte Carlo simulation of Hadron/Gluon interactions, from Parma
     and Cambridge.

hydroFortran two-dimensional lagrangian hydrodynamics, from Los Alamos.

isajetFortran Monte Carlo simulation of particle interactions, by Paige and
     Protopopescu, Brookhaven.

jetsetFortran Monte Carlo jet fragmentation, by Sjostrand and Bengtsson from
     Lund.

rayshade 406
     C ray tracing program, by Kolb, Bogart et al of Princeton.

reweight
     Fortran Monte Carlo simulation of particles through a detector, by Sey-
     mour of Washington University.

sim  C comparison of two sequences of DNA, by Huang and Miller of Penn State.

tc8  Fortran mathematical group theory, from University of Sydney, NSW.

wondyFortran solid dynamics simulation, from Sandia.

1000x1000 linpack
     Fortran based on LAPACK and [CDSZ]GEMM BLA

thesisIf I had a way to extract the source code for my thesis work from a
     CDC-6400 tape made in 1977, I might make a test program out of that -
     Fortran searching for minima of a complicated non-analytic complex func-
     tion, corresponding to the nearest polynomial with a double zero.

     Several of these candidates deviate in some respect from the ideals
listed above, and so might be superseded.   Thus the 1000x1000 linpack is an
exception to the rule against kernels, and subject to deletion if a suitable
substitute comes along, and optimizing my thesis research programs is of ques-
tionable economic utility.