SPARCstation 10 external cache performance study available for review

Wed Jan 6 16:24:18 PST 1993

For your comments prior to wider distribution,
I will send "tbl | troff -ms" source for the report of which the first
page is reproduced below.   If you are interested, let me know.   Please do
not reproduce it or further distribute it
until I post the final version to USENET newsgroups.

     Analysis of a number of CPU-bound programs displaying performance
anomalies on SPARCstation 10/41, relative to SPARCstation 10/30, indicates
that most anomalies are due to working sets larger than the 1 MB MXCC external
cache on 10/41.   Some others are due to infelicities of SC2.0.1 compiled
code.

     Below are some suggestions for programmers that may provide better per-
formance from existing 10/41's with existing SC2.0.1 compilers.  Some sugges-
tions that help the 10/41+SC2.0.1 combination may be counter-productive with
other hardware or other software releases.  All suggestions should be tried by
comparing performance before and after, on each specific application, to
insure that they actually confer a benefit:

*    Divide large computations into 1 MB chunks.

*    Avoid combining simple unit-stride array operations.

*    Avoid operations on structs or unions in inner loops.

*    Use tmpfs for large sequential files.

*    Transpose matrices to maximize the number of unit-stride inner loops.

*    Vary the leftmost subscript in Fortran inner loops.

*    Explicitly unroll rolled inner loops in source.

*    Explicitly roll unrolled inner loops in source.

*    Avoid traversing huge data structures.

*    Use Winograd-Strassen matrix multiplication methods for sufficiently
     large matrices.

     SPARCstation 10's contain Viking (SuperSPARC) CPU chips with internal
virtual-address 16KB 4-way associative data cache and 20KB 5-way associative
instruction cache, with cache line sizes of 32 bytes.  10/30 operates at 36
MHz and 10/41 at 40.3 MHz.   Thus the expected performance ratio of 10/41 exe-
cution times divided by 10/30 execution times is 36/40.3 = 89%, based on clock
rate alone.

     However 10/41 also contains a external physical-address 1MB MXCC (Super-
Cache) direct-mapped combined I&D cache, with a cache line size of 128 bytes.
This cache imposes a higher miss penalty - a minimum of 24 cycles, a maximum
of about 80 cycles - than the miss penalty for the internal cache.

     Thus programs that have small instruction and working data sets will
reside in the internal cache in both systems and should display a relative
performance of 89% due solely to clock rate.

     Programs with somewhat larger requirements, but fitting in the external
cache, may display significantly better relative performance - lower than 89%
- by eliminating many miss penalty cycles because data is supplied from the
external cache rather than main memory.

     But programs with very large working sets may display significantly worse
relative performance - greater than 89% - as data is missed in both the inter-
nal and external caches and incurs the greater external miss penalties.