OpenACC is a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to exploit parallel (co)processors, especially GPUs. OpenACC pragmas can be used to annotate codes to enable data location, data transfer, and loop or code block parallelism.
Though OpenACC has much in common with OpenMP, the syntax of the directives is different.
More importantly, OpenACC can best be described as having
a descriptive model, in contrast to the more prescriptive model presented by OpenMP.
This difference in philosophy can most readily be seen by, e.g., comparing the
acc loop directive
to the OpenMP implementation of the equivalent construct. In OpenMP, the programmer has responsibility
to specify how the parallelism in a loop is distributed (e.g., via
In OpenACC, the runtime determines how to decompose the iterations across gangs or workers and vectors.
At an even higher level, an OpenACC programmer can use the
acc kernels construct to allow the compiler complete freedom
to map the available parallelism in a code block to the available hardware.
OpenACC at a glance¶
Some of the most important data and control clauses for two of the most
used constructs in OpenACC programming -
$acc parallel and
$acc kernels - are
listed below. The data placement and movement clauses also appear in
$acc data constructs.
$acc loop provides control of parallelism similarly to
$acc parallel but provides loop-level control.
Much more detail can be found at:
OLCF Accelerator Programming Tutorials (includes examples of interoperability with CUDA and GPU libraries like CuFFT)
|num_gangs(expression)||Controls how many parallel gangs are created|
|num_workers(expression)||Controls how many workers are created in each gang|
|vector_length(list)||Controls vector length of each worker|
|private(list)||A copy of each variable in list is allocated to each gang|
|firstprivate(list)||private variables initialized from host|
|reduction(operator:list)||private variables combined across gangs|
|copy(list)||Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region|
|copyin(list)||Allocates memory on GPU and copies data from host to GPU when entering region|
|copyout(list)||Allocates memory on GPU and copies data to the host when exiting region|
|create(list)||Allocates memory on GPU but does not copy|
|present(list)||Data is already present on GPU from another containing data region|
How to use OpenACC on ASCR facilities¶
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
Benefits and Challenges¶
- Available for many different languages
- Interoperable with other approaches (e.g. CUDA or OpenMP)
- Allows performance optimization
- Controlled by well-defined standards bodies
- Relatively few compiler implementations at present (versus OpenMP)
- Evolving standards
- Descriptive approach sometimes impedes very high performance for a given kernel