SPARCstation 10 external cache performance study available for review
David G. Hough on validgh
dgh
Wed Jan 6 16:24:18 PST 1993
For your comments prior to wider distribution,
I will send "tbl | troff -ms" source for the report of which the first
page is reproduced below. If you are interested, let me know. Please do
not reproduce it or further distribute it
until I post the final version to USENET newsgroups.
Analysis of a number of CPU-bound programs displaying performance
anomalies on SPARCstation 10/41, relative to SPARCstation 10/30, indicates
that most anomalies are due to working sets larger than the 1 MB MXCC external
cache on 10/41. Some others are due to infelicities of SC2.0.1 compiled
code.
Below are some suggestions for programmers that may provide better per-
formance from existing 10/41's with existing SC2.0.1 compilers. Some sugges-
tions that help the 10/41+SC2.0.1 combination may be counter-productive with
other hardware or other software releases. All suggestions should be tried by
comparing performance before and after, on each specific application, to
insure that they actually confer a benefit:
* Divide large computations into 1 MB chunks.
* Avoid combining simple unit-stride array operations.
* Avoid operations on structs or unions in inner loops.
* Use tmpfs for large sequential files.
* Transpose matrices to maximize the number of unit-stride inner loops.
* Vary the leftmost subscript in Fortran inner loops.
* Explicitly unroll rolled inner loops in source.
* Explicitly roll unrolled inner loops in source.
* Avoid traversing huge data structures.
* Use Winograd-Strassen matrix multiplication methods for sufficiently
large matrices.
SPARCstation 10's contain Viking (SuperSPARC) CPU chips with internal
virtual-address 16KB 4-way associative data cache and 20KB 5-way associative
instruction cache, with cache line sizes of 32 bytes. 10/30 operates at 36
MHz and 10/41 at 40.3 MHz. Thus the expected performance ratio of 10/41 exe-
cution times divided by 10/30 execution times is 36/40.3 = 89%, based on clock
rate alone.
However 10/41 also contains a external physical-address 1MB MXCC (Super-
Cache) direct-mapped combined I&D cache, with a cache line size of 128 bytes.
This cache imposes a higher miss penalty - a minimum of 24 cycles, a maximum
of about 80 cycles - than the miss penalty for the internal cache.
Thus programs that have small instruction and working data sets will
reside in the internal cache in both systems and should display a relative
performance of 89% due solely to clock rate.
Programs with somewhat larger requirements, but fitting in the external
cache, may display significantly better relative performance - lower than 89%
- by eliminating many miss penalty cycles because data is supplied from the
external cache rather than main memory.
But programs with very large working sets may display significantly worse
relative performance - greater than 89% - as data is missed in both the inter-
nal and external caches and incurs the greater external miss penalties.
More information about the Numeric-interest
mailing list