Search for multiply-add speedup

Mon Jul 1 15:02:04 PDT 1991

In an attempt to quantify the circumstances in which parallel-dispatch
multiply-add is a big win, I conducted some experiments with an IBM RS/6000
using the current Fortran product with a VAST preprocessor.  I found the
results puzzling to some degree.  I encourage posting comparable
results to numeric-interest, from anybody with access to
an HP snake, or an i860 system, or anything else
else with a simultaneous-dispatch multiply-add.

In trying to quantify the speedup definitely attributable to the multiple-add
operator in IBM RS/6000 ISA, I coded up a program listed at the end that
computes a "matrix multiplication" of four perverse flavors; not only the
normal sum of products, but also a product of sums, product of products,
and sum of products.  The first can be executed with multiply-add primitives
while the others can't.
All of these run without generating any IEEE exceptions
and produce identical numerical results on SPARCstation-2's and on
an RS/6000 5x0 server.

Even with the VAST preprocessor active (xlf -P), 
the relative speedup of the regular
matrix product seems to be 30% at best over the other cases, and only for
small matrices.
This is not bad, but I was expected more benefit on larger matrices.
Following data is for 400x400, 200x200, and 100x100 matrices.
The 400x400 data set is about 3.8 MB; the 100x100 about 240 KB.

For comparison:
SPARCstation 2 results, Fortran 1.4, -O4 -cg89 -dalign -Bstatic, 400x400:

 add-add time     31.6200 check     22487.500000000
 mul-mul time     35.1500 check    -3.9439395423980D-50
 add-mul time     30.7100 check    -1.5796671500020D+20
 mul-add time     31.5200 check     7721.2347564697
 mul-mul/add-add     1.11164
 mul-mul/add-mul     1.14458
 mul-mul/mul-add     1.11516

More or less as expected, mul-mul (product of products) was somewhat slower
than the other possibilities, by 11-14%, probably
due to an extra cycle requirement.

IBM results, xlf -O -Q, 400x400:

  add-add time   53.11999893      check   22487.5000000000000     
  mul-mul time   49.68000412      check -0.394393954239797447E-49 
  add-mul time   49.88999176      check -0.157966715000203182E+21 
  mul-add time   49.53999329      check   7721.23475646972656     
  mul-mul/add-add  0.9352410436     
  mul-mul/add-mul  0.9957909584     
  mul-mul/mul-add   1.002826214   

Based on my understanding of the implementation, I would have thought
that multiply-add would be much faster than the others, which would be more
or less comparable.  Cache effects must be drowning everything else. 

IBM results, xlf -O -Q -P, 400x400:

  add-add time   9.470000267      check   22487.5000000000000     
  mul-mul time   9.290000916      check -0.394393954239797447E-49 
  add-mul time   9.279998779      check -0.157966715000203182E+21 
  mul-add time   8.259998322      check   7721.23475646972656     
  mul-mul/add-add  0.9809926748     
  mul-mul/add-mul   1.001077771     
  mul-mul/mul-add   1.124697685   

The VAST preprocessor has a significant impact here, vastly reducing
cache misses, but 12% relative advantage for multiply-add
is somewhat smaller than I'd expected.

xlf -O -Q for 200x200:

  add-add time   1.460000038      check  -24589.0625000000000     
  mul-mul time  0.9999998808      check -0.626182012680451482E-25 
  add-mul time  0.9900000095      check   321512910028764096.     
  mul-add time  0.7700002193      check   3227.36961555480957     
  mul-mul/add-add  0.6849313974     
  mul-mul/add-mul   1.010100842     
  mul-mul/mul-add   1.298700809  

30% is more like it.

xlf -O -Q -P for 200x200:

  add-add time   1.189999938      check  -24589.0625000000000     
  mul-mul time   1.149999976      check -0.626182012680451482E-25 
  add-mul time   1.150000095      check   321512910028764096.     
  mul-add time   1.019999743      check   3227.36961555480957     
  mul-mul/add-add  0.9663865566     
  mul-mul/add-mul  0.9999998808     
  mul-mul/mul-add   1.127451301    

The preprocessor doesn't help as much as I'd expect here - it generally makes
things worse, and dilutes the benefit of multiply-add.

xlf -O -Q for 100x100:

  add-add time  0.1700000018      check  -7717.38281250000000     
  mul-mul time  0.1200000346      check -0.829295853032013205E-05 
  add-mul time  0.1200000048      check   139338833709.862213     
  mul-add time  0.8999997377E-01  check   365.805094003677368     
  mul-mul/add-add  0.7058825493     
  mul-mul/add-mul   1.000000238     
  mul-mul/mul-add   1.333334088     

xlf -O -Q -P for 100x100:

  add-add time  0.1599999815      check  -7717.38281250000000     
  mul-mul time  0.1500000060      check -0.829295853032013205E-05 
  add-mul time  0.1499999762      check   139338833709.862213     
  mul-add time  0.1299999952      check   365.805094003677368     
  mul-mul/add-add  0.9375001192     
  mul-mul/add-mul   1.000000238     
  mul-mul/mul-add   1.153846264   

To the extent that you can tell anything about these very short time
intervals, it looks like the preprocessor doesn't do much good if the data
fits in the cache.

Looks like the benefit of multiply-add shows up primarily when the data fits
in the cache, even with the benefit of an agressive preprocessor.  
Correct conclusion?

Test program:

#define DP

#ifndef DIM
#define DIM 400
#endif

#ifdef SP
#define REAL real
#define TOREAL real
#define ZERO 0.0
#define ONE 1.0
#define PREC ' SINGLE '
#define NTIMES 1000/SPLIN
#endif

#ifdef DP
#define REAL doubleprecision
#define TOREAL dble
#define ZERO 0.0d0
#define ONE 1.0d0
#define PREC ' DOUBLE '
#define NTIMES 1000/DPLIN
#endif

#ifdef QP
#define REAL real*16
#define TOREAL qreal
#define ZERO 0.0q0
#define ONE 1.0q0
#define PREC ' QUADRUPLE '
#define NTIMES 1
#endif

	REAL a(DIM,DIM),b(DIM,DIM),c(DIM,DIM),check
	call matgen(a,DIM,DIM)
	call matgen(b,DIM,DIM)
	t1=second()
	call matAA(DIM,a,DIM,b,DIM,c,DIM)
	t2=second()
	tAA = t2-t1
	print *,' add-add time ',tAA,' check ',check(DIM,c,DIM)
	call matgen(a,DIM,DIM)
	call matgen(b,DIM,DIM)
	t1=second()
	call matMM(DIM,a,DIM,b,DIM,c,DIM)
	t2=second()
	tMM = t2-t1
	print *,' mul-mul time ',tMM,' check ',check(DIM,c,DIM)
	call matgen(a,DIM,DIM)
	call matgen(b,DIM,DIM)
	t1=second()
	call matAM(DIM,a,DIM,b,DIM,c,DIM)
	t2=second()
	tAM = t2-t1
	print *,' add-mul time ',tAM,' check ',check(DIM,c,DIM)
	call matgen(a,DIM,DIM)
	call matgen(b,DIM,DIM)
	t1=second()
	call matMA(DIM,a,DIM,b,DIM,c,DIM)
	t2=second()
	tMA = t2-t1
	print *,' mul-add time ',tMA,' check ',check(DIM,c,DIM)
	print *,' mul-mul/add-add ', tMM/tAA
	print *,' mul-mul/add-mul ', tMM/tAM
	print *,' mul-mul/mul-add ', tMM/tMA
	end

      subroutine matgen(a,lda,n)
      REAL a(lda,1)
c
      init = 1325
      do 30 j = 1,n
         do 20 i = 1,n
            init = mod(3125*init,65536)
            a(i,j) = (init - 32768.0)/16384.0
   20    continue
   30 continue
      return
      end

	real function second()
        real etime, tarray(2)
	second = etime(tarray)
        return
        end

	REAL function check(n,c,nc)
	REAL c(nc,n),t
	t=0
	do 1 i = 1, n
	do 2 j = 1, n
	t=t + c(i,j)
 2	continue
 1	continue
	check=t
	end

	subroutine matMA(n,a,na,b,nb,c,nc)
	REAL a(na,n), b(nb,n), c(nc,n),t
	do 1 i = 1,n
	do 2 j = 1,n
	t=0
	do 3 k = 1,n
	t = t + a(i,k)*b(k,j)
 3	continue
	c(i,j)=t
 2	continue
 1	continue
	end	

	subroutine matAA(n,a,na,b,nb,c,nc)
	REAL a(na,n), b(nb,n), c(nc,n),t
	do 1 i = 1,n
	do 2 j = 1,n
	t=0
	do 3 k = 1,n
	t = t + a(i,k)+b(k,j)
 3	continue
 	c(i,j)=t
 2	continue
 1	continue
	end	

	subroutine matAM(n,a,na,b,nb,c,nc)
	REAL a(na,n), b(nb,n), c(nc,n),t
	do 1 i = 1,n
	do 2 j = 1,n
	t=1
	do 3 k = 1,n
	t = t * (a(i,k)+b(k,j))
 3	continue
	c(i,j) = t
 2	continue
 1	continue
	end	

	subroutine matMM(n,a,na,b,nb,c,nc)
	REAL a(na,n), b(nb,n), c(nc,n),t
	do 1 i = 1,n
	do 2 j = 1,n
	t = 1
	do 3 k = 1,n
	t = t * (a(i,k)*b(k,j))
 3	continue
	c(i,j)=t
 2	continue
 1	continue
	end