Notes from BLAS-3 lecture by Van Loan

David G. Hough dgh
Mon Jul 16 15:15:10 PDT 1990


Prof. Charles Van Loan of Cornell presented some current issues in
mathematical software for high performance computers at a talk today at
IBM's Almaden Research Center.  If you already are familiar with LAPACK
and BLAS-3 you can skip the following list of points that caught my
attention.

LAPACK is based on level-3 BLAS that are matrix-matrix operations as
opposed to matrix-vector or vector-vector operations.  There are only
about seven level-3 BLAS functions, mostly involving matrix
multiplication of various flavors, and one involving solving triangular
system of linear equations.

The payoff for level-3 BLAS is that for n x n data, they are about n
times more efficient than lower-level BLAS, because each data item is
subject to n times as many floating-point operations, so that the ratio
of compute ops to load/store ops is much more favorable.  That memory
operations are much more significant than floating-point hardware in
the high performance end of this business is indicated by the
difficulty of writing a matrix transposition function that is efficient
on a variety of high-performance architectures; floating-point
arithmetic doesn't figure in at all.

Anyway the goal is to transform standard linear algebra operations into
a form in which most of the flops are in matrix multiplication routines
which can be highly optimized, at least in principle; then the problem
becomes how to express those matrix multiplications in a way that leads
to efficient compilation on most high-performance architectures.  For
instance, vector length and stride are not necessarily optimized by the
same algorithms.  There were several other tradeoffs; Van Loan is
looking for insights into how to parameterize these in a way that
mathematical software developers can exploit.

Taking adaptive quadrature as a kind of paradigm, we would like to be
able to design algorithms that adaptively choose between level-2 and
level-3 algorithms at run time to maximize efficiency.

For LU decomposition and related algorithms, the fraction of flops that
can be put into matrix multiplication tends to 1 - (1/N) when a matrix
is partitioned into N blocks, and the matrix multiplication algorithms
don't do any more flops than level-1 algorithms.  For typical QR
eigenvalue algorithms, however, the ratio seems to be stuck at about
1/2.

For 100x100 double complex matrices on IBM RIOS, the level-3 based code
is 2-3X faster than level-2.

Some of you will remember the Strassen theory for multiplying matrices
in less than n**3 time, based on divide and conquer and doing 2x2 with
7 instead of 8 multiplications.  This has been generally regarded as
only theoretically interesting since the asymptotic superiority only
shows up for very large dimensions and the numerical stability of the
process seems questionable.  According to Van Loan, David Bailey coded
up Strassen-style algorithms for Crays and found them to be faster than
the usual ones for problems of interesting size and of overall
comparable numerical stability.

Van Loan mentioned that a German investigator had written a matrix
multiplication thought to be optimal for some system - on the order of
100,000 lines of assembler code.  Given product lifetimes of two years
or so these days I have to question the value of that kind of effort.
It seems more fruitful to put that kind of effort into better compilers
and better languages that allow you to portably specify what the
compilers need to know to generate efficient code.



More information about the Numeric-interest mailing list