OpenMP Support¶
Clang fully supports OpenMP 4.5. Clang supports offloading to X86_64, AArch64, PPC64[LE] and has basic support for Cuda devices.
#pragma omp declare simd: Partial. We support parsing/semantic analysis + generation of special attributes for X86 target, but still missing the LLVM pass for vectorization.
In addition, the LLVM OpenMP runtime libomp supports the OpenMP Tools Interface (OMPT) on x86, x86_64, AArch64, and PPC64 on Linux, Windows, and macOS.
For the list of supported features from OpenMP 5.0 see OpenMP implementation details.
General improvements¶
New collapse clause scheme to avoid expensive remainder operations. Compute loop index variables after collapsing a loop nest via the collapse clause by replacing the expensive remainder operation with multiplications and additions.
The default schedules for the distribute and for constructs in a parallel region and in SPMD mode have changed to ensure coalesced accesses. For the distribute construct, a static schedule is used with a chunk size equal to the number of threads per team (default value of threads or as specified by the thread_limit clause if present). For the for construct, the schedule is static with chunk size of one.
Simplified SPMD code generation for distribute parallel for when the new default schedules are applicable.
When using the collapse clause on a loop nest the default behavior is to automatically extend the representation of the loop counter to 64 bits for the cases where the sizes of the collapsed loops are not known at compile time. To prevent this conservative choice and use at most 32 bits, compile your program with the -fopenmp-optimistic-collapse.
Cuda devices support¶
Directives execution modes¶
Clang code generation for target regions supports two modes: the SPMD and non-SPMD modes. Clang chooses one of these two modes automatically based on the way directives and clauses on those directives are used. The SPMD mode uses a simplified set of runtime functions thus increasing performance at the cost of supporting some OpenMP features. The non-SPMD mode is the most generic mode and supports all currently available OpenMP features. The compiler will always attempt to use the SPMD mode wherever possible. SPMD mode will not be used if:
The target region contains user code (other than OpenMP-specific directives) in between the target and the parallel directives.
Data-sharing modes¶
Clang supports two data-sharing models for Cuda devices: Generic and Cuda modes. The default mode is Generic. Cuda mode can give an additional performance and can be activated using the -fopenmp-cuda-mode flag. In Generic mode all local variables that can be shared in the parallel regions are stored in the global memory. In Cuda mode local variables are not shared between the threads and it is user responsibility to share the required data between the threads in the parallel regions.
Features not supported or with limited support for Cuda devices¶
Cancellation constructs are not supported.
Doacross loop nest is not supported.
User-defined reductions are supported only for trivial types.
Nested parallelism: inner parallel regions are executed sequentially.
Automatic translation of math functions in target regions to device-specific math functions is not implemented yet.
Debug information for OpenMP target regions is supported, but sometimes it may be required to manually specify the address class of the inspected variables. In some cases the local variables are actually allocated in the global memory, but the debug info may be not aware of it.
OpenMP 5.0 Implementation Details¶
The following table provides a quick overview over various OpenMP 5.0 features and their implementation status. Please post on the Discourse forums (Runtimes - OpenMP category) for more information or if you want to help with the implementation.
Category |
Feature |
Status |
Reviews |
---|---|---|---|
loop |
support != in the canonical loop form |
done |
D54441 |
loop |
#pragma omp loop (directive) |
worked on |
|
loop |
collapse imperfectly nested loop |
done |
|
loop |
collapse non-rectangular nested loop |
done |
|
loop |
C++ range-base for loop |
done |
|
loop |
clause: if for SIMD directives |
done |
|
loop |
inclusive scan (matching C++17 PSTL) |
done |
|
memory management |
memory allocators |
done |
r341687,r357929 |
memory management |
allocate directive and allocate clause |
done |
r355614,r335952 |
OMPD |
OMPD interfaces |
not upstream |
https://github.com/OpenMPToolsInterface/LLVM-openmp/tree/ompd-tests |
OMPT |
OMPT interfaces |
mostly done |
|
thread affinity |
thread affinity |
done |
|
task |
taskloop reduction |
done |
|
task |
task affinity |
not upstream |
|
task |
clause: depend on the taskwait construct |
mostly done |
D113540 (regular codegen only) |
task |
depend objects and detachable tasks |
done |
|
task |
mutexinoutset dependence-type for tasks |
done |
D53380,D57576 |
task |
combined taskloop constructs |
done |
|
task |
master taskloop |
done |
|
task |
parallel master taskloop |
done |
|
task |
master taskloop simd |
done |
|
task |
parallel master taskloop simd |
done |
|
SIMD |
atomic and simd constructs inside SIMD code |
done |
|
SIMD |
SIMD nontemporal |
done |
|
device |
infer target functions from initializers |
worked on |
|
device |
infer target variables from initializers |
worked on |
|
device |
OMP_TARGET_OFFLOAD environment variable |
done |
D50522 |
device |
support full ‘defaultmap’ functionality |
done |
D69204 |
device |
device specific functions |
done |
|
device |
clause: device_type |
done |
|
device |
clause: extended device |
done |
|
device |
clause: uses_allocators clause |
done |
|
device |
clause: in_reduction |
worked on |
r308768 |
device |
omp_get_device_num() |
worked on |
D54342 |
device |
structure mapping of references |
unclaimed |
|
device |
nested target declare |
done |
D51378 |
device |
implicitly map ‘this’ (this[:1]) |
done |
D55982 |
device |
allow access to the reference count (omp_target_is_present) |
done |
|
device |
requires directive |
partial |
|
device |
clause: unified_shared_memory |
done |
D52625,D52359 |
device |
clause: unified_address |
partial |
|
device |
clause: reverse_offload |
unclaimed parts |
D52780 |
device |
clause: atomic_default_mem_order |
done |
D53513 |
device |
clause: dynamic_allocators |
unclaimed parts |
D53079 |
device |
user-defined mappers |
worked on |
D56326,D58638,D58523,D58074,D60972,D59474 |
device |
mapping lambda expression |
done |
D51107 |
device |
clause: use_device_addr for target data |
done |
|
device |
support close modifier on map clause |
done |
D55719,D55892 |
device |
teams construct on the host device |
done |
r371553 |
device |
support non-contiguous array sections for target update |
done |
|
device |
pointer attachment |
unclaimed |
|
device |
map clause reordering based on map types |
unclaimed |
|
atomic |
hints for the atomic construct |
done |
D51233 |
base language |
C11 support |
done |
|
base language |
C++11/14/17 support |
done |
|
base language |
lambda support |
done |
|
misc |
array shaping |
done |
D74144 |
misc |
library shutdown (omp_pause_resource[_all]) |
unclaimed parts |
D55078 |
misc |
metadirectives |
worked on |
D91944 |
misc |
conditional modifier for lastprivate clause |
done |
|
misc |
iterator and multidependences |
done |
|
misc |
depobj directive and depobj dependency kind |
done |
|
misc |
user-defined function variants |
worked on |
D67294, D64095, D71847, D71830, D109635 |
misc |
pointer/reference to pointer based array reductions |
unclaimed |
|
misc |
prevent new type definitions in clauses |
done |
|
memory model |
memory model update (seq_cst, acq_rel, release, acquire,…) |
done |
OpenMP 5.1 Implementation Details¶
The following table provides a quick overview over various OpenMP 5.1 features and their implementation status, as defined in the technical report 8 (TR8). Please post on the Discourse forums (Runtimes - OpenMP category) for more information or if you want to help with the implementation.
Category |
Feature |
Status |
Reviews |
---|---|---|---|
atomic |
‘compare’ clause on atomic construct |
done |
D120290, D120007, D118632, D120200, D116261, D118547, D116637 |
atomic |
‘fail’ clause on atomic construct |
worked on |
|
base language |
C++ attribute specifier syntax |
done |
D105648 |
device |
‘present’ map type modifier |
done |
D83061, D83062, D84422 |
device |
‘present’ motion modifier |
done |
D84711, D84712 |
device |
‘present’ in defaultmap clause |
done |
D92427 |
device |
map clause reordering reordering based on ‘present’ modifier |
unclaimed |
|
device |
device-specific environment variables |
unclaimed |
|
device |
omp_target_is_accessible routine |
unclaimed |
|
device |
omp_get_mapped_ptr routine |
unclaimed |
|
device |
new async target memory copy routines |
unclaimed |
|
device |
thread_limit clause on target construct |
unclaimed |
|
device |
has_device_addr clause on target construct |
unclaimed |
|
device |
iterators in map clause or motion clauses |
worked on |
|
device |
indirect clause on declare target directive |
unclaimed |
|
device |
allow virtual functions calls for mapped object on device |
unclaimed |
|
device |
interop construct |
partial |
parsing/sema done: D98558, D98834, D98815 |
device |
assorted routines for querying interoperable properties |
unclaimed |
|
loop |
Loop tiling transformation |
done |
D76342 |
loop |
Loop unrolling transformation |
done |
D99459 |
loop |
‘reproducible’/’unconstrained’ modifiers in ‘order’ clause |
partial |
D127855 |
memory management |
alignment for allocate directive and clause |
worked on |
|
memory management |
new memory management routines |
unclaimed |
|
memory management |
changes to omp_alloctrait_key enum |
unclaimed |
|
memory model |
seq_cst clause on flush construct |
unclaimed |
|
misc |
‘omp_all_memory’ keyword and use in ‘depend’ clause |
done |
D125828, D126321 |
misc |
error directive |
unclaimed |
|
misc |
scope construct |
unclaimed |
|
misc |
routines for controlling and querying team regions |
unclaimed |
|
misc |
changes to ompt_scope_endpoint_t enum |
unclaimed |
|
misc |
omp_display_env routine |
unclaimed |
|
misc |
extended OMP_PLACES syntax |
unclaimed |
|
misc |
OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT env vars |
unclaimed |
|
misc |
‘target_device’ selector in context specifier |
unclaimed |
|
misc |
begin/end declare variant |
done |
D71179 |
misc |
dispatch construct and function variant argument adjustment |
worked on |
D99537, D99679 |
misc |
assume and assumes directives |
worked on |
|
misc |
nothing directive |
worked on |
|
misc |
masked construct and related combined constructs |
worked on |
D99995, D100514 |
misc |
default(firstprivate) & default(private) |
partial |
firstprivate done: D75591 |
other |
deprecating master construct |
unclaimed |
|
OMPT |
new barrier types added to ompt_sync_region_t enum |
unclaimed |
|
OMPT |
async data transfers added to ompt_target_data_op_t enum |
unclaimed |
|
OMPT |
new barrier state values added to ompt_state_t enum |
unclaimed |
|
OMPT |
new ‘emi’ callbacks for external monitoring interfaces |
unclaimed |
|
task |
‘strict’ modifier for taskloop construct |
unclaimed |
|
task |
inoutset in depend clause |
unclaimed |
|
task |
nowait clause on taskwait |
worked on |
OpenMP Extensions¶
The following table provides a quick overview over various OpenMP extensions and their implementation status. These extensions are not currently defined by any standard, so links to associated LLVM documentation are provided. As these extensions mature, they will be considered for standardization. Please post on the Discourse forums (Runtimes - OpenMP category) to provide feedback.
Category |
Feature |
Status |
Reviews |
---|---|---|---|
atomic extension |
prototyped |
D126323 |
|
device extension |
prototyped |
D106509, D106510 |