OpenACC¶
OpenACC is a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to exploit parallel (co)processors, especially GPUs. OpenACC pragmas can be used to annotate codes to enable data location, data transfer, and loop or code block parallelism.
Though OpenACC has much in common with OpenMP, the syntax of the directives is different.
More importantly, OpenACC can best be described as having
a descriptive model, in contrast to the more prescriptive model presented by OpenMP.
This difference in philosophy can most readily be seen by, e.g., comparing the acc loop
directive
to the OpenMP implementation of the equivalent construct. In OpenMP, the programmer has responsibility
to specify how the parallelism in a loop is distributed (e.g., via distribute
and schedule
clauses).
In OpenACC, the runtime determines how to decompose the iterations across gangs or workers and vectors.
At an even higher level, an OpenACC programmer can use the acc kernels
construct to allow the compiler complete freedom
to map the available parallelism in a code block to the available hardware.
OpenACC at a glance¶
Some of the most important data and control clauses for two of the most
used constructs in OpenACC programming - $acc parallel
and $acc kernels
- are
listed below. The data placement and movement clauses also appear in $acc data
constructs.
$acc loop
provides control of parallelism similarly to $acc parallel
but provides loop-level control.
Much more detail can be found at:
-
OLCF Accelerator Programming Tutorials (includes examples of interoperability with CUDA and GPU libraries like CuFFT)
construct | important clauses | description |
---|---|---|
$acc parallel |
||
num_gangs(expression) | Controls how many parallel gangs are created | |
num_workers(expression) | Controls how many workers are created in each gang | |
vector_length(list) | Controls vector length of each worker | |
private(list) | A copy of each variable in list is allocated to each gang | |
firstprivate(list) | private variables initialized from host | |
reduction(operator:list) | private variables combined across gangs | |
$acc kernels |
||
copy(list) | Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region | |
copyin(list) | Allocates memory on GPU and copies data from host to GPU when entering region | |
copyout(list) | Allocates memory on GPU and copies data to the host when exiting region | |
create(list) | Allocates memory on GPU but does not copy | |
present(list) | Data is already present on GPU from another containing data region |
How to use OpenACC on ASCR facilities¶
OLCF¶
Using C/C++¶
PGI Compiler
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
Cray Compiler
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
Using Fortran¶
PGI Compiler
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
Cray Compiler
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
Benefits and Challenges¶
Benefits¶
- Available for many different languages
- Interoperable with other approaches (e.g. CUDA or OpenMP)
- Allows performance optimization
- Controlled by well-defined standards bodies
Challenges¶
- Relatively few compiler implementations at present (versus OpenMP)
- Evolving standards
- Descriptive approach sometimes impedes very high performance for a given kernel