SPEC benchmarks for hardware buying decisions

Sun Nov 13 06:37:38 PST 1994

In a message posted to the numeric-interest list last August,
David Hough <dghavalidgh.com> asked some interesting philosophical
questions about how end users use SPEC benchmark results in deciding
what PC/workstation-class hardware to buy:

| I've got a question for end users who take SPEC ratios into consideration
| when purchasing PC's or workstations.   First some background:
| 
| The test programs in SPECfp92 have been criticized for being too simple,
| and unrealistic representations of real applications.     One way to address
| that would be to add more realistic applications to SPECfp92, but most
| realistic applications involve i/o performance (because of large input
| or intermediate or output data) and graphics performance.    Since 
| system performance on big problems may involve all of these factors,  
| adding large problems to SPEC might be viewed as a good thing. 
| 
| Another point of view is that SPECint92 and SPECfp92 are not intended to
| be system-level performance tests, but rather measurements of "CPU"
| capability.   Well not exactly: what they measure is a function of
| CPU chip(s), cache, compilers, and libraries, so it's not really pure
| CPU, but i/o, graphics, and system overhead in general are not important.
| Large test programs that might be CPU-intensive on large systems
| might become I/O-intensive on small systems.     So adding large programs
| to SPECfp92 might confuse more than illuminate.    In Unix terms,
| SPEC programs should be mostly user time on the types of "small" 
| PC's and workstations that are used for entry-level technical computing.
| As of 1993 I would estimate that comes out to be about 16MB RAM and 256KB
| cache, subject to ongoing increase.
| 
| Another approach would be to split the floating-point benchmarks into
| two groups.   One would be restricted to programs that are CPU-intensive on
| "small" systems, however defined, and the other would be unrestricted and
| thus might encompass significant I/O and graphics, which might be bottlenecks
| on small systems and perhaps not on larger systems.    Programs whose 
| working sets are larger than 32MB, such as perfect.mg3b, are an example.
| SPECfp92 would be defined over the "small" benchmarks, but no mean would
| be published for the large problems,  since their performance varies so much
| more according to the configuration of the test system.
| 
| So the question is:
| 
| What would be the most useful approach for you as an 
| end user evaluating computer hardware for purchase?   What changes, if any,
| would make SPEC results more usefully relevant to you?    Do any of the
| foregoing ideas appeal to you - if not, what should be done instead?

I would like to keep the basic principle that SPEC results reflect
mainly user-mode 'CPU' (actually CPU + cache + memory + compiler +
elementary-function-library) times, given enough memory that paging
isn't a serious factor.  Of course, I'd also like to see the SPEC
suite include applications that are as 'realistic' as possible, and
these days that means that some of them should have large working
sets (eg perfect.mg3b).  Presumably vendors will put enough memory
in their systems for paging not to be a serious factor, and the
standard SPEC reporting rules ensure that the amount of memory used
is disclosed.  Because small-working-set problems are very common,
the SPEC suite should probably only include a few (say 20%)
'huge-memory' programs.

The one serious problem I can see with this scheme is that it will
give some truly awful ratings to machines whose maximum supported
memory size still leaves huge-memory programs thrashing.  However,
except for the VAX 11/780, people don't usually retest old hardware
with new SPEC suites.

I'm not sure what to do about the VAX 11/780 problem.  Some people
have suggested renormalizing the next-generation SPEC, but I'd like
to keep the old and new numbers roughly comparable.  One possibility
would be to switch to a new reference machine or machines, but keep
approximately the old normalization, i.e. something along the lines
of

	(individual-benchmark SPEC ratio for machine X)
		  (SPEC92 rating of machine R) * (time on machine X)
		= --------------------------------------------------
				 (time on machine R)

where 'machine R' is a more modern reference machine, one which can
handle the 'huge-memory' SPEC programs without serious paging.

Another possibility would be to follow David's suggestion and split
SPEC.fp into SPEC.fp.{large,small}.  In general I like this better,
as it avoids having {large,small}-memory programs distorting the
SPEC ratios for {small,large} ones.  That is, presumably for many
systems SPEC.fp.large will be significantly different (generally
smaller) than SPEC.fp.small, so the same reasoning (which I agree
with) that led SPEC89 to be split into SPEC{int,fp}92 would suggest
a further split into next-generation-SPEC{int,fp-small,fp-large}.

My one concern about splitting the floating-point benchmark suite
is that if there are too many SPEC numbers the trade press (both
review articles and vendor ads) won't publish them all.  (For
example, a quick look through the past few months of 'Unix Review'
(TM) reveals several vendor ads and product review articles
comparing SPEC{fp,int}92, but alas *no* individual-benchmark SPEC
ratios.)  (I realise that the SPEC Newsletter contains all the
details, but in practice I, and I think almost all other buyers,
have never seen a copy of it, and don't have the time or the
inclination to try to track one down (at $500+/year, my local
University library probably doesn't have a copy).

In this latter vein, David has previously suggested (in a message
to numeric-interest back in June 1991) that the SPEC reporting
rules require reporting SPEC results as
	rating +/- tolerance
I think this is a very good idea.  This gives some indication of
the variance in performance, yet is still succinct enough to get
widely published.

- Jonathan Thornburg
  University of British Columbia / Physics Dept / <thornburaphysics.ubc.ca>
  "Washing one's hands of the conflict between the powerful and the powerless
   means to side with the powerful, not to be neutral." - Freire / OXFAM