interesting discussion of differences between C and Fortran from sci.math.num-analysis

David G. Hough on validgh dgh
Sat Aug 24 12:22:41 PDT 1991


From: homerapalm.cray.com (Bill Homer)
Subject: Re: Fortran vs. C on Cray Y-MP

In a recent comp.sys.super posting, you observed that C should not
automatically be dismissed in favor of Fortran for numerical programs on a
CRI system, and I'd like to expand on that theme.  My main points are that:

1.  The C language really does present more of a challenge than Fortran
    to an optimizing compiler.
2.  Nevertheless, a language extension combined with ever more aggressive
    compiler optimization can, in many cases, give comparable performance.
3.  Such extensions are being coordinated through the Numerical C
    Extensions Group with the goal that, in the not-too-distant future, it
    will be reasonable to use them in a program intended to be portable.
4.  No matter what the language, it helps to meet a compiler half-way, by
    choosing, among equivalent ways of expressing operations, those which
    most clearly expose opportunities for vector or parallel execution.

---------------------------------------------------------------------------

1.  Consider the example originally from Jim Giles in an earlier thread: 

      subroutine copyit(a,b,n)              void copyit(a,b,n)
      integer n                             int n;
      real a(n), b(n)                       float *a, *b;
      integer i, p, q                       {
                                               float *z; int p, q;
      p = irtc()                               p = _rtc();
      do i=1,n
         a(i) = b(i)                           for(z=a; z<a+n; *z++ = *b++);
      enddo
   1  a(1) = a(n)                              a[0] = a[n-1];
      q = irtc()                               q = _rtc();
      print *, q-p                             printf("%d\n",q-p);
      end                                   }

Performance when compiled by the Cray Fortran and Standard C compilers
with default options is shown in the first two lines below.  The third
line shows the effect of a familiar pragma, and the last line shows the
effect of two new features (see point 2 below), introduced in the 3.0
release of the Cray Standard C compiler.  With the use of those features,
both available through command line options, performance is equivalent to
that of the Fortran version.  The timings are typical values from several
batch executions (so that small differences are not significant).

				     Execution time (clocks)
				    -------------------------
				    n= 10  100	 1000	10000
				    -----  ---	 ----	-----
	Fortran			      124  212	 1243	11446
	C with no changes	      384  495	 1915	15972
	C with pragma ivdep	      133  258	 1623	15722
	C with restricted pointers    144  248	 1295	11406
	   and aggressive analysis

Although the two versions of copyit may look equivalent, they are not.
The Fortran version may be used only to copy values between two disjoint
arrays, or two disjoint portions of the same array.  The C version may
also be used to duplicate a block of one or more values within a single
array.  Thus a call of the form copyit(&c[1], &c[0], n) copies the value
of c[0] to c[1]...c[n].  More generally, a call of the form
copyit(&c[k], &c[0], n) copies the values of c[0]...c[k-1] to
c[k]...c[2*k-1], then copies these new values of c[k]...c[2*k-1]
to c[2*k]...c[3*k-1], etc.

To vectorize the loop in copyit, the C compiler must therefore include
code both to compute the maximum safe vector length k (from a-b), and to
ensure that each block of k stores completes before the next block of k
loads is initiated.  (Note that stores and loads can progress
asynchronously on separate ports to memory.)  As a result, the C version
performs worse than the Fortran version for copying between disjoint
objects, where this "extra" code is not needed.

Unfortunately, there is no way, within Standard C, to write copyit so that
it may be used only to copy between disjoint objects:  You must pay for
the full generality whether you need it or not.  In the timings shown
above, note that the cost (relative to Fortran) of computing the safe
vector length is more significant for short loops, and the cost of
ensuring that stores complete is more significant for long loops.

---------------------------------------------------------------------------

2.  The Cray Standard C compiler has always offered a pragma, "ivdep"
(ignore vector dependencies), for promoting vectorization.  For historical
reasons, however, this directive asserts only that k is not less than the
length of a vector register.  This allows the compiler to eliminate the
safe vector length calculation, but not the store completion code.

The most recent release, 3.0, also offers a language extension that
enables you to write copyit so that it has the same restriction as in
Fortran:  it may be used only to copy between disjoint objects.  If you
are willing to modify the source code, you can change the declarations of
the parameters to read:

		float * restrict a, * restrict b;

Here 'restrict' is a type qualifier, and the compiler may assume that a
pointer of restrict-qualified type points into a unique object (as if it
obtained its value from a call to malloc).  If you are unwilling to modify
the source code (and if it is appropriate to restrict the pointer
parameters of all other functions defined in the file), then you can use a
command line option, "cc -h restrict=f ...".

With restricted pointers, the C version of copyit is equivalent to the
Fortran version, but it is still more difficult for the compiler to
analyze because it uses the pointer, 'z', rather than using an index into
'a'.  Also new to release 3.0 is a more aggressive level of analysis that
tracks pointer values.  When combined with restricted pointers, it gives
the same level of performance as Fortran.  (The cost in added compile time
is an increase of about 50%, from 0.097 to 0.150 seconds.)

---------------------------------------------------------------------------

3.  An extension such as restricted pointers is most useful if it is
defined with as much care as the rest of the language, and if it is
supported on many systems.  CRI is therefore actively participating in the
Numerical C Extensions Group (ANSI X3J11.1), which has the goal of
producing a Technical Report defining a set of of language extensions for
numerical programming.  CRI has submitted to the NCEG proposals for:

  Restricted pointers.
    (analogous to Fortran dummy arguments, as described above)

  Variable length arrays.
    (analogous to Fortran adjustable and automatic arrays)

  Complex data types, and associated library functions.
    (with float or double real and imaginary parts)

All three are implemented in the 3.0 release of the Cray Standard C
compiler.  For programmers who are willing to make use of them, these
extensions should make C substantially more attractive for numerical
programming.  They 'restore' performance and features to C that are
'missing', compared to Fortran.

---------------------------------------------------------------------------

4.  For the other, more complicated example, the C and Fortran performance
were comparable (within 5%, with C somewhat faster).  But this shouldn't be
too surprising, since the code as written offers neither the C nor the
Fortran compiler a straightforward opportunity to vectorize.  Recall the
Fortran version:

C
C PAGE 141-142: NUMERICAL MATHEMATICS AND COMPUTING, CHENEY/KINCAID, 1985
C
C DERIVATIVE BY CENTER DIFFERENCES AND RICHARDSON EXTRAPOLATION (DERIV,F)
C
      SUBROUTINE DERIV(F,X,N,H,D,ID)
      DIMENSION D(ID,N)
      DO 3 I=1,N
        D(I,1)=(F(X+H)-F(X-H))/(2.0*H)
        Q=4.0
        DO 2 J=1,I-1
          D(I,J+1)=D(I,J)+(D(I,J)-D(I-1,J))/(Q-1.0)
          Q=4.0*Q
   2    CONTINUE
        H=H/2.0
   3  CONTINUE
      RETURN
      END

Although the current compiler cannot do it automatically, this code can be
reorganized by hand to give a vectorizable inner loop.  It requires
reordering and splitting the outer loop, and then inverting a loop nest
with a triangular pattern of reference.  This reduces execution time from
11925 to 8834 clocks, a 26% speedup (using the parameters in your posting:
F = sin, X = pi/3, N = 10, H = 1.0).

      SUBROUTINE DERIV(F,X,N,H,D,ID)
      DIMENSION D(ID,N)
      DO 1 I=1,N
        D(I,1)=(F(X+H)-F(X-H))/(2.0*H)
        H=H/2.0
   1  CONTINUE
      Q=4.0
      DO 3 J=1,N-1
        DO 2 I=J+1,N
          D(I,J+1)=D(I,J)+(D(I,J)-D(I-1,J))/(Q-1.0)
   2    CONTINUE
        Q=4.0*Q
   3  CONTINUE
      RETURN
      END

Note that availability of variable length arrays in C makes it much more
convenient to translate subroutines like these, which have dummy arguments
consisting of multidimensional arrays.  A translation of the reorganized
Fortran version (omitting a superfluous parameter, ID):

void deriv(double f(double), double x, int n, double h, double d[][n])
{
    int i, j;
    double q;

    for (i = 0; i < n; ++i) {
        d[i][0] = (f(x+h) - f(x-h)) / (2.0*h);
        h /= 2.0;
    }
    q = 4.0;
    for (j = 0; j < n-1 ; ++j) {
        # pragma _CRI ivdep
        for (i = j+1; i < n; ++i) {
            d[i][j + 1] = d[i][j] + (d[i][j] - d[i-1][j]) / (q - 1.0);
        }
        q *= 4.0;
    }
    return;
}

Note that although the parameter d is declared as a two dimensional variable
length array, the semantics of C causes the declaration to be interpreted as
a pointer:

	double (* d)[n]

Since pointers to variable length arrays are beyond the scope of the
aggressive level of optimization for the 3.0 release of the C compiler, it
is necessary to supply a pragma ivdep to get the inner loop to vectorize.
The resulting performance is comparable to Fortran at 8624 clocks.



More information about the Numeric-interest mailing list