Code structure

Original code structure

The inner loops of the FORTRAN-90 code have the following form.

!$OMP DO reduction(+:achtemp)
do igp = 1, ngpown

  do iw=1,3 ! Original Inner Loop Bad for Vectorization


    do ig = 1, ncouls

      delw = array2(ig,np) / (wxt(iw,np) - array1(ig,igp))
      scht = scht + ... * delw * array3(ig,igp) 

    enddo ! loop over g

    achtemp(iw) = achtemp(iw) + 0.5D0*scht*vcoul(igp)



where, in the production code, we block the ig loop around the iw loop in order to gain data reuse. However, the inner 'ig' loop is left appropriately long to get efficient vector performance - typically block sizes of around 256 are used, which is many vector lengths on a KNL processor.

Portability considerations

There are essentially three hot-arrays in this code, for convenience named array1, array2, array3 corresponding to the three complex-double precision arrays on the equation in the previous page: \(M\), \(\Omega\) and \(\tilde{\omega}\). It will be important to place these in the fastest memory tier.

The data-structures are generally double-precision complex. This is a native FORTRAN type, but is less standard in C/C++. Performance additionally requires a fast vectorizable-divide instruction for complex-numbers or a suitable work-around. This was an issue only the earlier generation Xeon-Phi, Knights Corner, for example.