Proposals for SPEC 3.0

David G. Hough on validgh dgh
Wed Jun 19 20:46:23 PDT 1991


[
At Sun, formatted copies of this memorandum may be obtained by
        tbl ~dgh/memo/spec3 | troff -ms
Comments are welcome prior to more general circulation
]

     When the idea that became SPEC first started circulating I was among the
many that agreed that it would be good if somebody, somewhere did all the work
necessary to establish an industry standard performance test suite to super-
sede *h*stone and the many linpacks, which had outlived their usefulness in an
era of rapid technological change.

     Fortunately a few somebodies somewhere did get together and do the work,
and SPEC 1.0 has been a tremendous success in re-orienting end users toward
realistic expectations about computer system performance on realistic applica-
tions, and in re-orienting hardware and software designers toward optimizing
performance for realistic applications.

     In that spirit I'd like to suggest some changes for consideration in SPEC
3.0, the second generation of compute-intensive benchmarks.  Many of the
suggestions come from study of the Perfect Club benchmarks and procedures,
which are more narrowly focused than SPEC, primarily on scientific Fortran
programs.

Reporting Rules

     In addition to the mandatory standard SPEC results which permit changes
to source code solely to permit portability, SPEC should also permit and
encourage optional publication of tuned SPEC results in which applications may
be rewritten for better performance on specific systems.  In the spirit of
SPEC, publication of tuned results must be accompanied by listings of the
differences between the tuned source code and the standard source code.

     These two types of results - on portable programs and on specifically
tuned programs - correspond to two important classes of end users.   Most
numerous are those who, for many reasons, can't or won't rewrite programs.
Their needs are best represented by SPEC results on standard portable source
code. More influential in the long run, but far fewer in numbers, are
leading-edge users who will take any steps necessary to get the performance
they require, including rewriting software for specific platforms.

     Arguing the legitimacy of rewrites by system vendors would be a black
hole for the SPEC organization.  Allowing rewrites under publicly scrutiny
leaves the decision about appropriateness to the leading-edge end users who
would have to make such a determination anyway.  Requiring tuned SPEC results
to always be accompanied by standard SPEC results and by the corresponding
source code diffs reminds the majority of end users of the cost required to
get maximum performance on their own applications.

General Content

     SPEC 3.0 benchmarks should time important realistic applications, as com-
plete as portability permits, from whose performance users may reasonably pro-
ject performance of their similar applications.

     It's also important to verify the correctness of the computed results, to
avoid the possibility of astounding performance computing erroneous results.
This correctness verification is ideally an independent step that is not timed
as part of the benchmark.  Somewhat in contradiction, it is highly desirable
that the correctness be in terms meaningful to the application: thus in the
linpack benchmark, correctness is determined by computing a normalized resi-
dual ||b-Ax|| rather than by printing out columns of x and hoping to devise a
test that discriminates insignificant differences from significant ones.  For
physical simulations, appropriate tests of correctness include checks that
physically conserved quantities such as momentum and energy are conserved com-
putationally.

     Benchmarks should be orthogonal in this statistical sense:  it should not
be possible statistically to predict the performance of one SPEC benchmark
with any accuracy across many SPEC member platforms based upon the known per-
formance of some other disjoint subset of SPEC benchmarks on those platforms.
As long as important realistic applications are chosen that can be reasonably
verified for correctness, and this orthogonality criterion is satisfied, I see
no need to arbitrarily limit the number of SPEC computational benchmarks.

     In addition to the current SPEC 3.0 candidates, I recommend to SPEC for
future consideration the Fortran programs collected by the PERFECT Club.
Aside from spice2g6, which has a different input deck, they seem to be pretty
much orthogonal to current SPEC 1.0 members.

     A number of the gcc and espresso subtests run too fast to be timed accu-
rately.  They should be replaced by more substantial ones.

Specific Comments - matrix300

     The matrix300 benchmark has outlived its usefulness.  Like linpack before
it, it demonstrates susceptibility to optimization seldom observed in realis-
tic applications.  Amazing performance improvements have been reported by
applying modern compiler technology previously reserved for vectorizing super-
computers:

                   System    Old SPECratio   New SPECratio

                   IBM 550        104             730
                   HP 730          36             510

Competitive compilers should indeed exploit such technology, but it does end
users no good to suggest that many realistic applications will subsequently
show 7X-15X performance improvements.  While the matrix multiplication portion
of such an application can and should demonstrate significant improvements,
the overall application's improvement will be tempered by the portions that
aren't susceptible to such optimizations.  The spirit of SPEC is much better
served by seeking and incorporating realistic applications for which matrix
multiplication is an important component among others.

Specific Comments - nasa7

     Nasa7 consists of the kernels of seven different important computational
applications.   As such it was much more realistic - because its kernels were
more realisticly complicated - than the livermore loops which it has largely
supplanted.  Each of the specific types of applications - involving matrix
multiplication, 2D complex FFT, linear equations solved by Cholesky, block
tridiagonal, complex gaussian elimination methods, etc. - should be
represented separately by realistic applications rather than somewhat arbi-
trarily lumped into one benchmark.

Specific Comments - doduc

     There is one troubling aspect of the doduc program: it lacks any good
test of correctness other than the number of iterations required to complete
the program, which number might not be a very reliable guide.  For instance,
if the simulated time is extended from 50 to 100 seconds, the number of itera-
tions appears to vary by 20% (20,000 - 24,000) among systems which appear to
behave similarly in shorter runs.  A better correctness criterion should be
devised if doduc is retained in SPEC 3.0.

Specific Comments - spice

     The greycode input deck doesn't seem to correspond to any very common
realistic computations, and takes a long time to run as well. A number of oth-
ers have been proposed; SPEC should include a number of them as spice2g6 sub-
tests.

     In addition I urge SPEC to consider the spice3 program from UCB.  It is
unusual - a publicly available, substantial scientific computation program,
written in C.  It accepts most of the input decks that spice2g6 accepts.

Specific Comments - gcc

     gcc 1.35 represents relatively old compiler technology.  gcc 2.0 is
designed to do the kinds of optimizations required for the kinds of hardware
platforms that sell on SPECmarks.  I encourage SPEC to replace gcc 1.35 with
2.0 as soon as the latter is available for distribution.

     In addition I urge SPEC to consider the f2c Fortran-to-C translator from
ATT.  It is another publicly available, substantial program written in C, with
many of the same kinds of analyses that a full Fortran compiler performs.

SPECmark Computations

     The SPECmark, SPECint, and SPECfp computations are based on geometric
means of execution times of realistic applications.  That's the correct
approach.  I'd base it on elapsed real time rather than user+system time,
since these may differ in cases like gcc in which some I/O is going on.  I'd
also make the baseline a currently widely available system such as a SPARCsta-
tion 2 rather than the rapidly disappearing VAX 780.  Thus I'd define "21
SPEC3.0marks" to be the best performance obtained on a SPARCstation 2,
although any other common RISC Unix workstation would serve about as well.

     I would handle the cases of multiple subtests somewhat differently than
SPEC 1.0 does: for gcc and espresso, and perhaps spice2g6 and spice3 in the
future.  Currently the run times of subtests are added up to get an overall
execution time.  For the same reason that the geometric mean of several tests
is appropriate for the overall SPECmark, the geometric mean of the SPECratios
of the subtests is the appropriate SPECratio for that test.  Thus instead of
adding up the times for all the 8 espresso inputs and comparing those to the
sum of the 8 times on the reference system, I'd compute the SPECratio for each
espresso input, compute the geometric mean of those 8 SPECratios, and use that
as the SPECratio for the espresso benchmark when computing the overall SPEC-
mark.

     SPECint and SPECfp are currently recognized as subsets of SPECmark.
Should single-precision floating-point results be treated separately from
double-precision?  How should SPEC be reported on systems which whose "single
precision" is 64-bit rather than the 32-bit common on workstations and PC's?
Although 64-bit computation is most common as a safeguard against roundoff,
many important computations are routinely performed in 32-bit single precision
with satisfactory results.

     To be meaningful to end users, SPEC source codes would ideally allow
easily changing the precision of variables, and vendors would be allowed and
encouraged to treat working precision like a compiler option, using the best
performance that yields correct results - of course documenting those choices.
I know from experience, however, the tedium of adapting source codes to be so
flexible; and such flexibility also requires greater care in testing correct-
ness of results.

Why Geometric Mean is Best for SPEC

     The progress of some end users is limited by the time it takes a fixed
series of computational tasks to complete.  They then think about the results
and decide what to do next.  The appropriate metric for them is the total
elapsed time for the applications to complete, so the arithmetic mean of times
is the appropriate summary statistic.  If rates, the inverse of times, happen
to be available instead, the appropriate statistic is the harmonic mean of
rates.  If application A runs ten times as long as application B, then a 2X
improvement in application A is ten times as important as a 2X improvement in
application B.

     Other computational situations are characterized by a continual backlog
of processes awaiting execution.  If the backlog were ever extinguished, the
grid densities would be doubled and saturation would again result.  In these
cases the appropriate metric is rates - computations per time - and the
appropriate summary statistic is an arithmetic mean of rates, or if times are
available, a harmonic mean of times.  A 2X improvement in application A is
just as important as a 2X improvement in application B.

     What about the commonest case consisting of workloads of both sorts?
With geometric means of SPECratios, the conclusions are the same whether rates
or times are used, and rate data and time data may readily be combined.
That's why I like to use the geometric mean to combine SPECratios of diverse
types of programs.

     As with benchmarks themselves, the most appropriate way to combine bench-
mark results varies among end users according to their situation.  SPEC has
wisely chosen the most neutral way to combine results - while requiring that
individual results be available as well.

A Measure of Variance for SPEC?

     It is a feature of modern high-performance computing systems that their
relative performance varies tremendously across different types of applica-
tions.  It is therefore inevitable, rather than a defect in SPEC, that a sin-
gle performance figure has so little predictive power.

     SPEC requires that SPECmark results be accompanied by SPECratios for each
test.  This is a reasonable requirement, but it is not realistic to expect end
users to absorb 30 or more performance numbers for every system.  Some addi-
tional simple means of representing variance is warranted.

     I propose that every SPECmean computed by geometric mean of SPECratios be
accompanied by a + tolerance representing the variance in the set of SPECra-
                 -
tios used to compute the geometric mean.  Inasmuch as geometric mean of
SPECratios is the exponential of the arithmetic mean of the logs of SPECra-
tios, I propose that the tolerance be computed from the standard deviation s
of the logs of the SPECratios in this way:

        u = mean(log(SPECratios))
        s = standard deviation(log(SPECratios))
        U = exp(u)
        S = U*(exp(2*s) - 1)
        round U to nearest two significant figures
        round S upward to same number of decimal places as U
        SPECmean = S + U
                     -

Thus I would summarize the results of some recent experimental compiler tests
as

        SPECmark = 20 + 2
                      -
        SPECint  = 19 + 2
                      -
        SPECfp   = 21 + 1
                      -

Strictly speaking statistically, the SPECmean would be

        exp(u) + exp(u)*(exp(2*s)-1) - exp(u)/(exp(2*s)-1)

but simplicity recommends the earlier formulation.




More information about the Numeric-interest mailing list