Search for multiply-add speedup
David Hough
sun!Eng!David.Hough
Mon Jul 1 15:02:04 PDT 1991
In an attempt to quantify the circumstances in which parallel-dispatch
multiply-add is a big win, I conducted some experiments with an IBM RS/6000
using the current Fortran product with a VAST preprocessor. I found the
results puzzling to some degree. I encourage posting comparable
results to numeric-interest, from anybody with access to
an HP snake, or an i860 system, or anything else
else with a simultaneous-dispatch multiply-add.
In trying to quantify the speedup definitely attributable to the multiple-add
operator in IBM RS/6000 ISA, I coded up a program listed at the end that
computes a "matrix multiplication" of four perverse flavors; not only the
normal sum of products, but also a product of sums, product of products,
and sum of products. The first can be executed with multiply-add primitives
while the others can't.
All of these run without generating any IEEE exceptions
and produce identical numerical results on SPARCstation-2's and on
an RS/6000 5x0 server.
Even with the VAST preprocessor active (xlf -P),
the relative speedup of the regular
matrix product seems to be 30% at best over the other cases, and only for
small matrices.
This is not bad, but I was expected more benefit on larger matrices.
Following data is for 400x400, 200x200, and 100x100 matrices.
The 400x400 data set is about 3.8 MB; the 100x100 about 240 KB.
For comparison:
SPARCstation 2 results, Fortran 1.4, -O4 -cg89 -dalign -Bstatic, 400x400:
add-add time 31.6200 check 22487.500000000
mul-mul time 35.1500 check -3.9439395423980D-50
add-mul time 30.7100 check -1.5796671500020D+20
mul-add time 31.5200 check 7721.2347564697
mul-mul/add-add 1.11164
mul-mul/add-mul 1.14458
mul-mul/mul-add 1.11516
More or less as expected, mul-mul (product of products) was somewhat slower
than the other possibilities, by 11-14%, probably
due to an extra cycle requirement.
IBM results, xlf -O -Q, 400x400:
add-add time 53.11999893 check 22487.5000000000000
mul-mul time 49.68000412 check -0.394393954239797447E-49
add-mul time 49.88999176 check -0.157966715000203182E+21
mul-add time 49.53999329 check 7721.23475646972656
mul-mul/add-add 0.9352410436
mul-mul/add-mul 0.9957909584
mul-mul/mul-add 1.002826214
Based on my understanding of the implementation, I would have thought
that multiply-add would be much faster than the others, which would be more
or less comparable. Cache effects must be drowning everything else.
IBM results, xlf -O -Q -P, 400x400:
add-add time 9.470000267 check 22487.5000000000000
mul-mul time 9.290000916 check -0.394393954239797447E-49
add-mul time 9.279998779 check -0.157966715000203182E+21
mul-add time 8.259998322 check 7721.23475646972656
mul-mul/add-add 0.9809926748
mul-mul/add-mul 1.001077771
mul-mul/mul-add 1.124697685
The VAST preprocessor has a significant impact here, vastly reducing
cache misses, but 12% relative advantage for multiply-add
is somewhat smaller than I'd expected.
xlf -O -Q for 200x200:
add-add time 1.460000038 check -24589.0625000000000
mul-mul time 0.9999998808 check -0.626182012680451482E-25
add-mul time 0.9900000095 check 321512910028764096.
mul-add time 0.7700002193 check 3227.36961555480957
mul-mul/add-add 0.6849313974
mul-mul/add-mul 1.010100842
mul-mul/mul-add 1.298700809
30% is more like it.
xlf -O -Q -P for 200x200:
add-add time 1.189999938 check -24589.0625000000000
mul-mul time 1.149999976 check -0.626182012680451482E-25
add-mul time 1.150000095 check 321512910028764096.
mul-add time 1.019999743 check 3227.36961555480957
mul-mul/add-add 0.9663865566
mul-mul/add-mul 0.9999998808
mul-mul/mul-add 1.127451301
The preprocessor doesn't help as much as I'd expect here - it generally makes
things worse, and dilutes the benefit of multiply-add.
xlf -O -Q for 100x100:
add-add time 0.1700000018 check -7717.38281250000000
mul-mul time 0.1200000346 check -0.829295853032013205E-05
add-mul time 0.1200000048 check 139338833709.862213
mul-add time 0.8999997377E-01 check 365.805094003677368
mul-mul/add-add 0.7058825493
mul-mul/add-mul 1.000000238
mul-mul/mul-add 1.333334088
xlf -O -Q -P for 100x100:
add-add time 0.1599999815 check -7717.38281250000000
mul-mul time 0.1500000060 check -0.829295853032013205E-05
add-mul time 0.1499999762 check 139338833709.862213
mul-add time 0.1299999952 check 365.805094003677368
mul-mul/add-add 0.9375001192
mul-mul/add-mul 1.000000238
mul-mul/mul-add 1.153846264
To the extent that you can tell anything about these very short time
intervals, it looks like the preprocessor doesn't do much good if the data
fits in the cache.
Looks like the benefit of multiply-add shows up primarily when the data fits
in the cache, even with the benefit of an agressive preprocessor.
Correct conclusion?
Test program:
#define DP
#ifndef DIM
#define DIM 400
#endif
#ifdef SP
#define REAL real
#define TOREAL real
#define ZERO 0.0
#define ONE 1.0
#define PREC ' SINGLE '
#define NTIMES 1000/SPLIN
#endif
#ifdef DP
#define REAL doubleprecision
#define TOREAL dble
#define ZERO 0.0d0
#define ONE 1.0d0
#define PREC ' DOUBLE '
#define NTIMES 1000/DPLIN
#endif
#ifdef QP
#define REAL real*16
#define TOREAL qreal
#define ZERO 0.0q0
#define ONE 1.0q0
#define PREC ' QUADRUPLE '
#define NTIMES 1
#endif
REAL a(DIM,DIM),b(DIM,DIM),c(DIM,DIM),check
call matgen(a,DIM,DIM)
call matgen(b,DIM,DIM)
t1=second()
call matAA(DIM,a,DIM,b,DIM,c,DIM)
t2=second()
tAA = t2-t1
print *,' add-add time ',tAA,' check ',check(DIM,c,DIM)
call matgen(a,DIM,DIM)
call matgen(b,DIM,DIM)
t1=second()
call matMM(DIM,a,DIM,b,DIM,c,DIM)
t2=second()
tMM = t2-t1
print *,' mul-mul time ',tMM,' check ',check(DIM,c,DIM)
call matgen(a,DIM,DIM)
call matgen(b,DIM,DIM)
t1=second()
call matAM(DIM,a,DIM,b,DIM,c,DIM)
t2=second()
tAM = t2-t1
print *,' add-mul time ',tAM,' check ',check(DIM,c,DIM)
call matgen(a,DIM,DIM)
call matgen(b,DIM,DIM)
t1=second()
call matMA(DIM,a,DIM,b,DIM,c,DIM)
t2=second()
tMA = t2-t1
print *,' mul-add time ',tMA,' check ',check(DIM,c,DIM)
print *,' mul-mul/add-add ', tMM/tAA
print *,' mul-mul/add-mul ', tMM/tAM
print *,' mul-mul/mul-add ', tMM/tMA
end
subroutine matgen(a,lda,n)
REAL a(lda,1)
c
init = 1325
do 30 j = 1,n
do 20 i = 1,n
init = mod(3125*init,65536)
a(i,j) = (init - 32768.0)/16384.0
20 continue
30 continue
return
end
real function second()
real etime, tarray(2)
second = etime(tarray)
return
end
REAL function check(n,c,nc)
REAL c(nc,n),t
t=0
do 1 i = 1, n
do 2 j = 1, n
t=t + c(i,j)
2 continue
1 continue
check=t
end
subroutine matMA(n,a,na,b,nb,c,nc)
REAL a(na,n), b(nb,n), c(nc,n),t
do 1 i = 1,n
do 2 j = 1,n
t=0
do 3 k = 1,n
t = t + a(i,k)*b(k,j)
3 continue
c(i,j)=t
2 continue
1 continue
end
subroutine matAA(n,a,na,b,nb,c,nc)
REAL a(na,n), b(nb,n), c(nc,n),t
do 1 i = 1,n
do 2 j = 1,n
t=0
do 3 k = 1,n
t = t + a(i,k)+b(k,j)
3 continue
c(i,j)=t
2 continue
1 continue
end
subroutine matAM(n,a,na,b,nb,c,nc)
REAL a(na,n), b(nb,n), c(nc,n),t
do 1 i = 1,n
do 2 j = 1,n
t=1
do 3 k = 1,n
t = t * (a(i,k)+b(k,j))
3 continue
c(i,j) = t
2 continue
1 continue
end
subroutine matMM(n,a,na,b,nb,c,nc)
REAL a(na,n), b(nb,n), c(nc,n),t
do 1 i = 1,n
do 2 j = 1,n
t = 1
do 3 k = 1,n
t = t * (a(i,k)*b(k,j))
3 continue
c(i,j)=t
2 continue
1 continue
end
More information about the Numeric-interest
mailing list