OpenMP

OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs. The OpenMP API uses the fork-join model of parallel execution. Multiple threads of execution perform tasks defined implicitly or explicitly by OpenMP directives. (Text taken from OpenMP FAQ and API specification.)

Although the directives in early versions of the OpenMP specification focused on thread-level parallelism, more recent versions (especially 4.0 and 4.5) have generalized the specification to address more complex types (and multiple types) of parallelism, reflecting the increasing degree of on-node parallelism in HPC architectures. In particular, OpenMP 4.0 introduced the simd and target constructs. We discuss each of these in detail below.

omp simd

Decorating a loop with the simd construct informs the compiler that the loop iterations are independent and can be executed with SIMD instructions (e.g., AVX-512 on Intel Xeon Phi), e.g.,

!$omp simd
do i = 1, array_size
  a(i) = b(i) * c(i)
end do
!$omp end simd

Example output from a compiler optimization report for this loop is as follows:

LOOP BEGIN at main.f90(9,3)
   remark #15388: vectorization support: reference A(i) has aligned access   [ main.f90(10,5) ]
   remark #15388: vectorization support: reference B(i) has aligned access   [ main.f90(10,12) ]
   remark #15388: vectorization support: reference C(i) has aligned access   [ main.f90(10,19) ]
   remark #15305: vectorization support: vector length 16
   remark #15399: vectorization support: unroll factor set to 4
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 6
   remark #15477: vector cost: 0.310
   remark #15478: estimated potential speedup: 19.200
   remark #15488: --- end vector cost summary ---
LOOP END

The simd construct can be combined with the traditional parallel for (or parallel do in Fortran) constructs in order to execute the loop with both multi-threading and with SIMD instructions, e.g.,

!$omp parallel do simd
do i = 1, array_size
  a(i) = b(i) * c(i)
end do
!$omp end parallel do simd

The optimization report for the above snippet is as follows:

Begin optimization report for: MAIN

    Report from: OpenMP optimizations [openmp]

main.f90(8:9-8:9):OMP:MAIN__:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Vector optimizations [vec]

LOOP BEGIN at main.f90(8,9)
   remark #15388: vectorization support: reference a(i) has aligned access   [ main.f90(10,5) ]
   remark #15389: vectorization support: reference b(i) has unaligned access   [ main.f90(10,12) ]
   remark #15389: vectorization support: reference c(i) has unaligned access   [ main.f90(10,19) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 32
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 0.667
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 2
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 6
   remark #15477: vector cost: 0.370
   remark #15478: estimated potential speedup: 15.670
   remark #15488: --- end vector cost summary ---
LOOP END

It is important to note that compilers generally analyze loops (even those undecorated with omp simd) to determine if they can be executed with SIMD instructions; applying this OpenMP construct usually allows the compiler to skip its loop dependency checks and immediately generate a SIMD version of the loop. Consequently, improper use of omp simd, e.g., on a loop which indeed carries dependencies between iterations, can generate wrong code. This construct shifts the burden of correctness from the compiler to the user.

For example, consider the following loop, with a write-after-read dependency:

do i = 1, array_size
  a(i) = b(i) * a(i-1)
end do

Attempting to compile it without the simd construct yields the following optimization report:

LOOP BEGIN at main.f90(8,3)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between a(i) (9:5) and a(i-1) (9:5)
LOOP END

The compiler has determined that the loop iterations cannot be executed in SIMD. However, if we introduce the simd construct, this assures the compiler (incorrectly) that the loop iterations can be executed in SIMD. Using the construct results in the following report:

LOOP BEGIN at main.f90(9,3)
   remark #15388: vectorization support: reference A(i) has aligned access   [ main.f90(10,5) ]
   remark #15388: vectorization support: reference B(i) has aligned access   [ main.f90(10,12) ]
   remark #15389: vectorization support: reference A(i-1) has unaligned access   [ main.f90(10,19) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 32
   remark #15399: vectorization support: unroll factor set to 2
   remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 6
   remark #15477: vector cost: 0.340
   remark #15478: estimated potential speedup: 17.450
   remark #15488: --- end vector cost summary ---
LOOP END

This example illustrates the prescriptive nature of OpenMP directives; they allow the user to instruct the compiler precisely how on-node parallelism should be expressed, even if the compiler's own correctness-checking heuristics indicate that the desired approach will generate incorrect results.

omp target

The OpenMP target device construct maps variables to a device data environment and executes the construct on that device. A region enclosed with the target construct is assigned a target task to be executed on the device. This construct supports several additional keywords which provide the user with control of which data is moved to and from the device. Specifically, data movement is achieved via the map keyword, which accepts a list of variables to be copied between the host and device.

Consider the following snippet:

!$omp target map(to:b,c) map(from:a)
do i = 1, array_size
  a(i) = b(i) * c(i)
end do
!$omp end target

The compiler report from the following code offloaded to an Intel Xeon Phi coprocessor is as follows:

    Report from: Offload optimizations [offload]

OFFLOAD:main(8,9):  Offload to target MIC 1
 Evaluate length/align/alloc_if/free_if/alloc/into expressions
   Modifier expression assigned to __offload_free_if.19
   Modifier expression assigned to __offload_alloc_if.20
   Modifier expression assigned to __offload_free_if.21
   Modifier expression assigned to __offload_alloc_if.22
   Modifier expression assigned to __offload_free_if.23
   Modifier expression assigned to __offload_alloc_if.24
 Data sent from host to target
       i, scalar size 4 bytes
       __offload_stack_ptr_main_$C_V$5.0, pointer to array reference expression with base
       __offload_stack_ptr_main_$B_V$6.0, pointer to array reference expression with base
 Data received by host from target
       __offload_stack_ptr_MAIN__.34, pointer to array reference expression with base 

LOOP BEGIN at main.f90(12,3)
   remark #15388: vectorization support: reference A(i) has aligned access   [ main.f90(13,5) ]
   remark #15389: vectorization support: reference B(i) has unaligned access   [ main.f90(13,12) ]
   remark #15389: vectorization support: reference C(i) has unaligned access   [ main.f90(13,19) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 32
   remark #15399: vectorization support: unroll factor set to 2
   remark #15309: vectorization support: normalized vectorization overhead 0.654
   remark #15300: LOOP WAS VECTORIZED
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 2
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 7
   remark #15477: vector cost: 0.400
   remark #15478: estimated potential speedup: 17.180
   remark #15488: --- end vector cost summary ---
   remark #25015: Estimate of max trip count of loop=1024
LOOP END

The same code offloaded to an NVIDIA Tesla GPU shows the following compiler report (from a different compiler than the ones shown above):

    1.           program main
    2.             implicit none
    3.
    4.             integer, parameter :: array_size = 65536
    5.             real, dimension(array_size) :: a, b, c
    6.             integer :: i
    7.
    8.    fA--<>   b(:) = 1.0
    9.    f---<>   c(:) = 2.0
   10.
   11.  + G----<   !$omp target map(to:b,c) map(from:a)
   12.    G g--<   do i = 1, array_size
   13.    G g        a(i) = b(i) * c(i)
   14.    G g-->   end do
   15.    G---->   !$omp end target
   16.
   17.             print *, a(1)
   18.
   19.           end program main

ftn-6230 ftn: VECTOR MAIN, File = main.f90, Line = 8
  A loop starting at line 8 was replaced with multiple library calls.

ftn-6004 ftn: SCALAR MAIN, File = main.f90, Line = 9
  A loop starting at line 9 was fused with the loop starting at line 8.

ftn-6405 ftn: ACCEL MAIN, File = main.f90, Line = 11
  A region starting at line 11 and ending at line 15 was placed on the accelerator.

ftn-6418 ftn: ACCEL MAIN, File = main.f90, Line = 11
  If not already present: allocate memory and copy whole array "c" to accelerator, free at line 15 (acc_copyin).

ftn-6418 ftn: ACCEL MAIN, File = main.f90, Line = 11
  If not already present: allocate memory and copy whole array "b" to accelerator, free at line 15 (acc_copyin).

ftn-6420 ftn: ACCEL MAIN, File = main.f90, Line = 11
  If not already present: allocate memory for whole array "a" on accelerator, copy back at line 15 (acc_copyout).

ftn-6430 ftn: ACCEL MAIN, File = main.f90, Line = 12
  A loop starting at line 12 was partitioned across the 128 threads within a threadblock.

Note in the last compiler report that OpenMP automatically threads the loop and partitions the threads into threadblocks of the appropriate size for the device executing the loop.

Benefits and Challenges

Benefits

  • Available for many different languages
  • Prescriptive control of execution
  • Allow performance optimization
  • Controlled by well-defined standards bodies

Challenges

  • Sensitive to compiler support/maturity
  • Evolving standards

Compiler Support

The OpenMP project maintains a table of compiler support for different features: OpenMP Compiler Support.