unit stride vs non-unit stride on supercomputers

David Hough sun!Eng!David.Hough
Thu Jan 28 17:30:35 PST 1993


I am guilty of disseminating some major misinformation.   I'm looking for the
correct information to disseminate instead.

Those of you involved with supercomputers, minisupers, superminis, and 
high-end RISC workstations know that floating-point performance is often
a function of memory bandwidth, and memory bandwidth is rather sensitive to
how memory is touched.

RISC workstations with internal or external caches usually prefer memory
access with unit stride, e.g. 

	do 1 i=1,n
	x(i) = ...

or

	do 1 i=1,n
	x(i,j) = ...

in Fortran.   The following is almost always inferior in performance:

	do 1 i=1,n
	x(j,i) = ...

because the data accessed on successive iterations is unlikely to be in the
cache.

In contrast, supercomputers like Crays don't have caches, but have banks
of memory instead.   There stride apparently doesn't matter as long as you
avoid hitting the same bank on successive iterations.

This is where I got confused.   I somehow got the idea that unit stride
was bad for Crays, and said so, only to be corrected by people who knew 
something about it.

However, it may be the case that there are some high-performance systems for
which unit stride is not optimal.   Many programs ported from supercomputers,
that reveal performance
anomalies on RISC workstations, seem to intentionally use non-unit stride,
presumably for performance, since it doesn't seem particularly natural in
context.
And some versions of the Kuck preprocessor have an option for specifying whether
unit or non-unit stride is preferable.

So the question is: for which systems, if any are still in use, does unit
stride usually perform worse than non-unit stride? 




More information about the Numeric-interest mailing list