SPEC 3 recommendations - third version
David Hough
sun!Eng!David.Hough
Thu Jun 27 16:26:47 PDT 1991
This version incorporates suggestions received after my usenet posting.
What should be changed in SPEC 3.0 besides adding new benchmark
programs?
Sun employees may format by
tbl ~dgh/memo/spec3 | troff -ms
I will email troff source to others on request to dghaeng.sun.com.
When the idea that became SPEC first started circulating I was among the
many that agreed that it would be good if somebody, somewhere did all the work
necessary to establish an industry standard performance test suite to super-
sede *h*stone and the many linpacks, which had outlived their usefulness in an
era of rapid technological change.
Fortunately a few somebodies somewhere did get together and do the work,
and the SPEC 1.0 benchmark suite has been a tremendous success in re-orienting
end users toward realistic expectations about computer system performance on
realistic applications, and in re-orienting hardware and software designers
toward optimizing performance for realistic applications.
Why does SPEC need to establish a new, second generation compute-
intensive benchmark suite, SPEC 3.0, just as 1.0 is getting well established?
Because the computer business is an extremely dynamic one, and performance
measurement techniques have lifetimes little better than the products they
measure - a year or two! See the discussion of matrix300 below.
Certain desirable changes for SPEC 3.0 and subsequent versions will keep
them oriented to the needs of end users. Some of the suggestions come from
study of the Perfect Club benchmarks and procedures, which are more narrowly
focused than SPEC, primarily on supercomputer scientific Fortran programs.
Some suggestions below entail substantial technical effort to realize.
Readers should be aware that SPEC's own resources are minimal and fully util-
ized in porting and reporting. SPEC does not develop benchmark programs, nor
can it engage in massive rewrites, and the system vendor employees who do
SPEC's porting and reporting can't be expert in all technical application
areas. The best way to improve the quality of the SPEC suite is for end-user
organizations to support their employees' efforts to identify good benchmark
candidates for SPEC and undertake the necessary work to put them in portable,
distributable form with appropriate tests of correctness.
SPEC's goals might be better advanced in the long term if SPEC test
suites and machine-readable newsletter contents were freely available via a
netlib mechanism, with SPEC licensing the right to advertise various
SPECstats, rather than SPEC licensing the right to use the source code. This
is particularly important if system vendors are allowed to tune source code,
as proposed below, since tuned source code may change relatively frequently.
However any such change in distribution might entail a corresponding change in
SPEC's funding mechanism!
What's a Floating-Point Application?
Defining "floating-point-intensive application" and "integer application"
is an interesting problem in itself. If floating-point operations constitute
less than 1% of the total dynamic instruction count on all platforms that
bother to measure, that's surely an integer application. If floating-point
operations constitute more than 10% of the total dynamic instruction count on
all platforms that bother to measure, that's surely a floating-point applica-
tion. Intermediate cases may represent important application areas; these
should not be included in SPEC 3.0 however unless at least three can be iden-
tified. spice running the greycode input could be the first. Should these
mixed cases be included in SPECint, or SPECfp, or form a third subcategory
SPECmixed?
Instruction counts can be tricky, especially on vector machines. In
practice I have found the following to be useful: compare the Sun-3 run times
for a program compiled with -ffpa and -fsoft. If fsoft/ffpa > 4 then the pro-
gram is floating-point. If fsoft/ffpa < 2 then the program is integer. A
similar approach is to compare times on a Sun-4 with hardware floating point
enabled and disabled; the critical ratios are more like 100 and 10.
Floating-point Precision
SPECint and SPECfp are currently recognized as subsets of SPECmark.
Should single-precision floating-point results be treated separately from
double-precision? How should SPECfp be reported on systems which whose "sin-
gle precision" is 64-bit rather than the 32-bit common on workstations and
PC's? Although 64-bit computation is most common as a safeguard against
roundoff, many important computations are routinely performed in 32-bit single
precision with satisfactory results. To bypass these issues SPEC Fortran
source programs declare floating-point variables as either "real*4" or
"real*8" - no "real" or "doubleprecision".
To be meaningful to end users, SPEC source codes would ideally allow
easily changing the precision of variables, and vendors would be allowed and
encouraged to treat working precision like a compiler option, using the best
performance that yields correct results - of course documenting those choices.
I know from experience, however, the great tedium of adapting source codes to
be so flexible; and such flexibility also requires greater care in testing
correctness of results. Furthermore, whether such flexibility is permissible
has to be determined for each benchmark individually, as discussed next.
Results from Tuned Source Code
In addition to the mandatory standard SPEC results which permit changes
to source code solely to permit portability, SPEC should also permit optional
publication of tuned SPEC results in which applications may be rewritten for
better performance on specific systems. In the spirit of SPEC, publication of
tuned results must be accompanied by listings of the differences between the
tuned source code and the portable source code. If these differences are so
massive as to discourage publication, perhaps that's a signal to the system
vendors that they've been unrealistic in tuning.
Over time tuning rules will be discovered for particular benchmark pro-
grams that are abbreviated in some way from realistic applications. Basi-
cally, a tuned source code for an abbreviated benchmark is illegal if the
corresponding tuning would render the original application invalid. An exam-
ple: if a long-running simulation is abbreviated to run for a fairly short
period of time to make a reasonable benchmark, using unstable algorithms or
computing in lower precision are unallowable tunings if they would produce
unacceptable results for the application, even if they produce acceptable
results for the abbreviated benchmark. Similar considerations apply to
"unsafe" compiler optimizations. The creators or users of the original appli-
cation will have to advise on whether particular tunings are unrealistic.
These two types of results - on portable programs and on specifically
tuned programs - correspond to two important classes of end users. Most
numerous are those who, for many reasons, can't or won't rewrite programs.
Their needs are best represented by SPEC results on standard portable source
code. More influential in the long run, but far fewer in numbers, are
leading-edge users who will take any steps necessary to get the performance
they require, including rewriting software for specific platforms. Supercom-
puter users are often in this class, as are former supercomputer users who
have migrated to high-performance workstations.
SPEC previously allowed publication of results for source codes enhanced
for performance. This was a mistake because it was not accompanied by all the
specific source code changes! All confirmed SPEC results must be reproducible
by unassisted independent observers from published source codes and Makefiles
and commercially available hardware and software.
Arguing the legitimacy of rewrites by system vendors would be a black
hole for the SPEC organization. Allowing rewrites under public scrutiny
leaves the decision about appropriateness to the leading-edge end users who
would have to make such a determination anyway. Requiring tuned SPEC results
to always be accompanied by portable SPEC results and by the corresponding
source code diffs reminds the majority of end users of the cost required to
get maximum performance on specific platforms.
Reporting Results
Just as tuned SPECstats should never be confused with portable SPECstats,
projected SPECstats for unreleased hardware or software products should never
be confused with confirmed SPECstats. A confirmed SPECstat is one that can be
reproduced by anybody because the benchmark sources and Makefiles are avail-
able from SPEC, and the hardware and software are publicly available. If pro-
jected SPECstats are to be permitted at all, they must be unambiguously iden-
tified as such, they must be complete with all projected SPECratios, and they
must be accompanied by an anticipated date when they will be confirmed.
Nor should SPECstats computed from SPEC 3.0 be confused with those com-
puted from SPEC 1.0. All SPECstats should be qualified with an identification
of the SPEC suite used to compute them. The calendar year of publication is
easiest to remember. Thus integer performance results derived from SPEC 3.0
benchmarks published in 1992 should be identified:
SPECint.92
confirmed from portable source
SPECint.92.projected
projected from portable source
SPECint.92.tuned
confirmed from tuned source, diffs attached
SPECint.92.tuned.projected
projected from tuned source, diffs attached
Similarly for SPECfp.
I suspect the overall SPECmark has outlived its usefulness - there is no
reason to expect SPECint and SPECfp to be closely correlated in general. Oth-
erwise there would be no need to measure both. The workstation market - the
primary focus of SPEC members - is a complicated one. Some customers need
integer performance, some need floating point, many need both - in contrast to
the supercomputer market, which is mostly driven by floating-point perfor-
mance, and the PC market, driven mostly by integer performance.
All SPECstats.92 should include an indication of the dispersion of the
underlying set of SPECratios used to compute the SPECstat.92 geometric mean.
It is a feature of modern high-performance computing systems that their rela-
tive performance varies tremendously across different types of applications.
It is therefore inevitable, rather than a defect in SPEC, that a single per-
formance figure has so little predictive power. This means a single number
should never be cited as a SPECstat.92. If any circumstance warrants publish-
ing just one SPECmark, let it be the worst SPECratio of all the programs!
SPEC requires that SPECmark results be accompanied by SPECratios for each
test. This is an important requirement, but it is not realistic to expect
every consumer of SPEC results to absorb 30 or more performance numbers for
every system under consideration. Some additional means of representing
dispersion is warranted. A simple method is to quote the 12 and 88 percentile
SPECratios as a range:
SPECint.92 = 15..21
means that at least 3/4 of the SPECratios are in the 15..21 range, and the
worst eighth of the SPECratios are < 15 and the best eighth > 21. In this form
the quoted results are independent of the best and worst ratios, which I con-
sider undesirable. I prefer to express the range as the SPECmean + a toler-
-
ance:
SPECint.92 = 19 + 4
-
means that the SPECmean is 19 and the interval 15..23 contains at least 3/4 of
the SPECratios.
Nobody should make decisions based on insignificant differences in
SPECstats, and the rounding rules should emphasize that:
means rounded to nearest to two significant figures
tolerances rounded up to same number of figures after point as means
left interval endpoints rounded down to two significant figures
right interval endpoints rounded up to two significant figures
Either presentation, 19+4 or 15..21, emphasizes the futility of buying deci-
-
sions based on SPECmean differences in the third significant figure.
Ranges could be computed based upon standard deviations instead of per-
centiles. In any sample, at least 3/4 of the data is within two standard
deviations of the arithmetic mean. Since the geometric mean of SPECratios is
the exponential of the arithmetic mean of the logs of SPECratios, the toler-
ance can be computed from the standard deviation of the logs of the SPECratios
in this way:
u = mean(log(SPECratios))
s = standard deviation(log(SPECratios))
U = exp(u)
T = U*(exp(2*s) - 1)
SPECmean = U + T
-
Or standard deviations can be used to generate lower and upper bounds:
Lower bound = exp(u - 2*s)
Upper bound = exp(u + 2*s)
But both these standard-deviation methods are much more work computationally
and ultimately shed no more light than percentile methods.
Reporting Make Times
Many workstation users are software developers who spend a lot of time in
the edit-compile-debug cycle and so are interested in compile times as well as
execution times. I suggest that SPEC not require reporting compile times, but
specify a format, to facilitate comparison, if they are reported.
Thus I'd define SPECint.make as the elapsed real time to make the com-
plete SPECint suite minus the times allocated to execution of the SPECint pro-
grams themselves, so that SPECint.make reports total make, compile, link,
results verification, and miscellaneous overhead. Unlike the SPEC benchmarks
themselves, make/compile/links are all rather similar, so there is little
inherent merit in computing a ratio relative to a fixed reference, nor in com-
puting a geometric mean of a number of such ratios.
Reporting SPECint.make is optional, but if SPECint.make is reported, then
the corresponding SPECint is required. So you can't report the best
SPECint.make and the best SPECint as if they could be achieved simultaneously
with the same compiler options unless that's actually the case. But it would
be permissible to report two {SPECint, SPECint.make} pairs, one pair for best
SPECint, another pair for best SPECint.make, always documenting the differ-
ences in how these results were obtained so that independent parties can
reproduce them unaided.
Summary of Reporting Format
Thus I propose that SPEC 3.0 results be reported in the form
SPECint.92 = UI + TI SPECfp.92 = UF + TF
- -
U = SPECmean, T = tolerance; at least 3/4 of SPECratios are contained in the
interval [U-T, U+T]. The complete list of portable SPECratios follows.
Optionally after that,
SPECint.92.make = MI seconds SPECfp.92.make = MF seconds
where M is the total elapsed time to make SPECstat.92 results from scratch,
less the benchmark execution times that go into the SPECstat.92 calculation.
Optionally after that,
SPECint.92.tuned = UI + TI SPECfp.92.tuned = UF + TF
- -
followed by the complete list of tuned SPECratios AND the complete list of
source differences between tuned and portable source. Add another column for
SPECmixed.92 if SPEC 3.0 introduces such a subcategory, and another column for
overall SPECmark.92 if SPEC 3.0 retains it. If a single SPECmark.92 computed
over all subcategories is to be retained, it should be computed as the
geometric mean of the subcategory SPECstats so that it's immune to changes in
the proportions of programs in the various subcategories.
Verifying Correctness
Astounding performance computing erroneous results is easy but not very
interesting. Correctness verification is ideally an independent step that is
not timed as part of the benchmark. Somewhat in contradiction, it is highly
desirable that the correctness be in terms meaningful to the application. For
physical simulations, appropriate tests of correctness include checks that
physically conserved quantities such as momentum and energy are conserved com-
putationally.
Consider the linpack 1000x1000 benchmark as an example, because it's easy
to analyze rather than because it's appropriate for SPEC 3.0. The rules
require you to use the data generation and result testing software provided by
Dongarra, but you may code the computation any reasonable way appropriate to
the system. Correctness is determined by computing and printing a single
number, a normalized residual ||b-Ax||, that depends on all the quantities x
computed in the program - thus foiling optimizers aggressively eliminating
dead code.
How does one determine whether a residual is acceptable? Unfortunately
that question can only be answered by the designers or users of the applica-
tion. In this respect the linpack benchmark is artificial because there
really is no a-priori reason to draw the line of acceptable residuals at 10,
or 100, or 1000... it depends on the intended use of the results.
If the correctness criterion were established in absolute terms (rather
than relative to the underlying machine precision as linpack's normalized
residual does) then there would be no harm in rewriting programs in higher
precision and avoiding pivoting, if that produced acceptable results and
improved performance.
The difference between absolute correctness criteria and criteria rela-
tive to machine precision reflects differences between requirements placed on
complete applications and requirements of mathematical software libraries.
The complete application typically needs to compute certain quantities to some
known absolute accuracy. Mathematical software libraries, like the Linpack
library from which the well-known benchmark was drawn, will be used by many
applications with differing requirements, so the quality of the libraries
should be the highest reasonably obtainable with a particular arithmetic pre-
cision, and thus is best measured in units of that precision.
Accepting the output of a known or presumed good reference machine, on
faith, is widespread but questionable. I was long dismayed that compilers I
worked on were never able to get more than about 8 correct digits on some of
the livermore loops, even running in 128-bit precision, but then I discovered
that current Cray machines and compilers can't do any better. The "correct"
checksums embedded in the livermore loops program are said to have been pro-
duced on a now obsolete supercomputer, perhaps a Cray-1.
General Content
SPEC 3.0 benchmarks should time important realistic applications, as com-
plete as portability permits, from whose performance users may reasonably pro-
ject performance of their similar applications.
Benchmarks should be independent in this statistical sense: it should
not be possible statistically to predict the performance of one SPEC benchmark
with any accuracy across many SPEC member platforms based upon the known per-
formance of some other disjoint subset of SPEC benchmarks on those platforms.
As long as important realistic applications are chosen that can be reasonably
verified for correctness, and this independence criterion is satisfied, I see
no need to arbitrarily limit the number of SPEC computational benchmarks.
In addition to the current SPEC 3.0 candidates, I recommend to SPEC for
future consideration the Fortran programs collected by the PERFECT Club.
Aside from spice2g6, which has a different input deck, they seem mostly
independent of current SPEC 1.0 programs.
Some of the SPEC 1.0 gcc and espresso subtests run too fast to be timed
accurately. They should be replaced by more substantial ones.
SPEC should have no qualms about accepting programs that depend on de-
facto standard extensions such as "integer*2" or "complex*16" that can't rea-
sonably be expressed otherwise. However in the interest of portability, it's
desirable to avoid syntactic-sugar extensions like "do .. end do" that can be
readily expressed in standard syntax.
Specific Comments - matrix300
The matrix300 benchmark has outlived its usefulness. Like linpack before
it, it has forced adoption of new technology that has in turned made it
obsolete, for it is now susceptible to optimization improvements seldom
observed in realistic applications. Amazing performance improvements have
been reported by applying modern compiler technology previously reserved for
vectorizing supercomputers:
System Old SPECratio New SPECratio
IBM 550 100 730
HP 730 36 510
Competitive compilers should indeed exploit such technology, but it does end
users no good to suggest that many realistic applications will subsequently
show 7X-14X performance improvements. Such results are simply an artifact of
this particular artificial benchmark, and demonstrate how misleading it is to
present SPEC 1.0 performance with one number. Inasmuch as the SPEC 1.0 ver-
sion of matrix300 does not actually report any numerical results, the entire
execution could legitimately be eliminated as dead code, although so far
nobody has exhibited such temerity.
While the matrix multiplication portion of some realistic applications
can and should demonstrate significant improvements, the overall application's
improvement will be tempered by the portions that aren't susceptible to such
optimizations. nasa7 includes a matrix multiplication kernel, but the spirit
of SPEC is much better served by incorporating into SPEC 3.0 certain proposed
realistic applications of which matrix multiplication is one important com-
ponent among others.
Specific Comments - nasa7
Nasa7 consists of the kernels of seven different important computational
applications. As such it was much more realistic - because its kernels were
more realisticly complicated - than the livermore loops which it has largely
supplanted. Each of the specific types of applications - involving matrix
multiplication, 2D complex FFT, linear equations solved by Cholesky, block
tridiagonal, and complex gaussian elimination methods, etc. - should be
represented separately by realistic applications rather than somewhat arbi-
trarily lumped into one benchmark: the repetition factors for the seven ker-
nels are 100, 100, 200,20,2,10,400. The original version of this program
reported seven rates, for which repetition factors don't matter; the rates
excluded the time spent setting up the data and checking the results. But SPEC
measures elapsed times of whole programs, and combining arbitrary multiples of
diverse kernels into one such program produces performance results that are
difficult to interpret at best. It's analogous to defining the SPECmark as
the ratio of the sum of the times to execute all the SPEC programs to a refer-
ence sum of times, without publishing the individual times or ratios.
Specific Comments - doduc
There is one troubling aspect of the doduc program: it lacks any good
test of correctness other than the number of iterations required to complete
the program, which number might not be a very reliable guide. For instance,
if the simulated time is extended from 50 to 100 seconds, the number of itera-
tions appears to vary by 20% among systems which appear to behave similarly in
shorter runs, undermining confidence in the correctness of the shorter runs.
doduc is an interesting and valuable benchmark that should be retained in SPEC
3.0 if a more confidence-inspiring correctness criterion can be devised.
Platform Iterations to Iterations to
50 seconds 100 seconds
MIPS M/2000 5480 20400
Sun-3 5480 20400
Sun-4 5480 22700
IBM RS/6000 5490 24600
Specific Comments - spice
The greycode input deck doesn't seem to correspond to any very common
realistic computations, and takes a long time to run as well. A number of oth-
ers have been proposed; SPEC 3.0 should include a number of them as spice2g6
subtests.
In addition I urge SPEC to consider, when opportunity permits, the spice3
program from UCB. It is unusual - a publicly available, substantial scien-
tific computation program, written in C. It accepts many of the input decks
that spice2g6 accepts.
Specific Comments - gcc
gcc 1.35 represents relatively old compiler technology suitable for
CISC-based systems such as VAX, 80386, and 68020. gcc 2.0 is designed to do
the kinds of aggressive local optimizations required for RISC architectures -
such as most of the hardware platforms sold on the basis of their SPECmarks.
I encourage SPEC to replace gcc 1.35 with 2.0 as soon as the latter is avail-
able for distribution.
In addition I urge SPEC to consider the f2c Fortran-to-C translator from
ATT. It is another publicly available, substantial program written in C, with
many of the same kinds of analyses that a full Fortran compiler performs.
SPECstat Computations
SPECint and SPECfp are geometric means of SPECratios of elapsed real
times of realistic applications. That's the correct approach.
I would handle the cases of multiple subtests somewhat differently than
SPEC 1.0 does: for gcc and espresso, and perhaps spice2g6 and spice3 in the
future. Currently the run times of subtests are added up to get an overall
execution time. For the same reason that the geometric mean of several tests
is appropriate for the overall SPECmark, the geometric mean of the SPECratios
of the subtests is the appropriate SPECratio for that test. Thus instead of
adding up the times for all the 4 espresso inputs and comparing those to the
sum of the 4 times on the reference system, I'd compute the SPECratio for each
espresso input, compute the geometric mean of those 4 SPECratios, and use that
as the SPECratio for the espresso benchmark when computing the overall SPEC-
mark.
SPECstat.92 Reference Times
The VAX 780 is rapidly disappearing but is anything but rapid in running
SPEC programs. For convenience, the reference system for SPEC 3.0 should be
widely available and as fast as possible. One could choose candidate refer-
ence platforms on the basis of SPECmass, the performance equivalent of
biomass: SPECmass = SPECstat * installed base. On that basis one of the
SPARCstations might be selected, but it doesn't matter too much - any recent
widely available RISC Unix workstation would do. The reference times would be
the best elapsed times achieved on the reference system by the time SPEC 3.0
was announced, using any combination of compiler and operating system produc-
ing correct results.
The results for the reference system would be remarkably balanced - all
SPECratios equal to 1 - which might appear to be to the advantage of the ven-
dor of the reference system, but any such advantage is illusory. With product
lifetimes of a year or so, the reference system - which by definition has a
large installed base and therefore is near its end of life - would be out of
production during most of the lifetime of the SPEC 3.0 suite, and any replace-
ment products from that vendor would likely have SPECratios that would be far
from uniform.
The marketing science abstraction of a "balanced system" seldom
corresponds to reality anyway - a system can only provide balanced performance
relative to a particular mix of applications. If SPEC needs politically to
avoid choosing one particular reference system, it could compromise by choos-
ing several of roughly comparable integer performance, using one for integer
benchmark reference results, one for floating-point reference results, etc.
I prefer to have a single widely available platform for reference times
for this reason: in my own work I determine performance on a variety of
benchmarks and applications including
* SPEC,
* PERFECT,
* third parties' proprietary programs,
* and programs portable among 32-bit UNIX RISC workstations with IEEE
arithmetic, but not to all SPEC member platforms.
I like to summarize these performance results with SPEC-like ratios and
means, which are somewhat arbitrary, unless I can produce the reference times
that SPEC would use if those programs were part of SPEC.
To avoid intentional or accidental confusion between SPEC 1.0 SPECstats
and SPEC 3.0 SPECstats.92, it's desirable to recalibrate SPECstats.92. If the
SPARCstation 2 were chosen as the SPEC 3 reference, for instance, ignoring the
effects of using a different suite of benchmarks, then SPECstat.92 would be
immediately deflated by a factor of 21, relative to the SPEC 1.0 SPECstat for
the same platform, reducing opportunities for confusion.
Why Geometric Mean is Best for SPEC
The progress of some end users is limited by the time it takes a fixed
series of computational tasks to complete. They then think about the results
and decide what to do next. The appropriate metric for them is the total
elapsed time for the applications to complete, so the arithmetic mean of times
is the appropriate summary statistic. If rates, the inverse of times, happen
to be available instead, the appropriate statistic is the harmonic mean of
rates. If application A runs ten times as long as application B, then a 2X
improvement in application A is ten times as important as a 2X improvement in
application B.
Other computational situations are characterized by a continual backlog
of processes awaiting execution. If the backlog were ever extinguished, the
grid densities would be doubled and saturation would again result. In these
cases the appropriate metric is rates - computations per time - and the
appropriate summary statistic is an arithmetic mean of rates, or if times are
available, a harmonic mean of times. A 2X improvement in application A is
just as important as a 2X improvement in application B.
What about the commonest case consisting of workloads of both sorts?
With geometric means of SPECratios, the conclusions are the same whether rates
or times are used, and rate data and time data may readily be combined.
That's why I like to use the geometric mean to combine SPECratios of diverse
types of programs.
As with benchmarks themselves, the most appropriate way to combine bench-
mark results varies among end users according to their situation. SPEC has
wisely chosen the most neutral way to combine results - while requiring that
individual results be available as well.
For Further Information About SPEC
About SPEC, contact Kaivalya Dixit, dixitaeng.sun.com.
Acknowledgements
Helpful criticism by Kaivalya Dixit, Charles Grassl, and others improved
this report.
More information about the Numeric-interest
mailing list