Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[OpenMP] Change build of OpenMP device runtime to be a separate runtime #136729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Apr 22, 2025

Summary:
Currently we build the OpenMP device runtime as part of the offload/
project. This is problematic because it has several restrictions when
compared to the normal offloading runtime. It can only be built with an
up-to-date clang and we need to set the target appropriately. Currently
we hack around this by creating the compiler invocation manually, but
this patch moves it into a separate runtimes build.

This follows the same build we use for libc, libc++, compiler-rt, and
flang-rt. This also moves it from offload/ into openmp/ because it
is still the openmp/ runtime and I feel it is more appropriate. We do
want a generic offload/ library at some point, but it would be trivial
to then add that as a separate library now that we have the
infrastructure that makes adding these new libraries trivial.

This most importantly will require that users update their build
configs, mostly adding the following lines at a minimum. I was debating
whether or not I should 'auto-upgrade' this, but I just went with a
warning.

    -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda'     \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \

This also changed where the .bc version of the library lives, but it's
still created.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' openmp:libomp OpenMP host runtime openmp:libomptarget OpenMP offload runtime offload labels Apr 22, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2025

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-offload

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
Currently we build the OpenMP device runtime as part of the offload/
project. This is problematic because it has several restrictions when
compared to the normal offloading runtime. It can only be built with an
up-to-date clang and we need to set the target appropriately. Currently
we hack around this by creating the compiler invocation manually, but
this patch moves it into a separate runtimes build.

This follows the same build we use for libc, libc++, compiler-rt, and
flang-rt. This also moves it from offload/ into openmp/ because it
is still the openmp/ runtime and I feel it is more appropriate. We do
want a generic offload/ library at some point, but it would be trivial
to then add that as a separate library now that we have the
infrastructure that makes adding these new libraries trivial.

This most importantly will require that users update their build
configs, mostly adding the following lines at a minimum. I was debating
whether or not I should 'auto-upgrade' this, but I just went with a
warning.

    -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda'     \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \

This also changed where the .bc version of the library lives, but it's
still created.


Patch is 24.72 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/136729.diff

36 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/CommonArgs.cpp (+5)
  • (modified) offload/CMakeLists.txt (+7-1)
  • (removed) offload/DeviceRTL/CMakeLists.txt (-181)
  • (modified) offload/cmake/caches/Offload.cmake (+2-2)
  • (modified) openmp/CMakeLists.txt (+45-31)
  • (added) openmp/device/CMakeLists.txt (+99)
  • (renamed) openmp/device/include/Allocator.h ()
  • (renamed) openmp/device/include/Configuration.h ()
  • (renamed) openmp/device/include/Debug.h ()
  • (renamed) openmp/device/include/DeviceTypes.h ()
  • (renamed) openmp/device/include/DeviceUtils.h ()
  • (renamed) openmp/device/include/Interface.h ()
  • (renamed) openmp/device/include/LibC.h ()
  • (renamed) openmp/device/include/Mapping.h ()
  • (renamed) openmp/device/include/Profiling.h ()
  • (renamed) openmp/device/include/State.h ()
  • (renamed) openmp/device/include/Synchronization.h ()
  • (renamed) openmp/device/include/Workshare.h ()
  • (renamed) openmp/device/include/generated_microtask_cases.gen ()
  • (renamed) openmp/device/src/Allocator.cpp ()
  • (renamed) openmp/device/src/Configuration.cpp ()
  • (renamed) openmp/device/src/Debug.cpp ()
  • (renamed) openmp/device/src/DeviceUtils.cpp ()
  • (renamed) openmp/device/src/Kernel.cpp ()
  • (renamed) openmp/device/src/LibC.cpp ()
  • (renamed) openmp/device/src/Mapping.cpp ()
  • (renamed) openmp/device/src/Misc.cpp ()
  • (renamed) openmp/device/src/Parallelism.cpp ()
  • (renamed) openmp/device/src/Profiling.cpp ()
  • (renamed) openmp/device/src/Reduction.cpp ()
  • (renamed) openmp/device/src/State.cpp ()
  • (renamed) openmp/device/src/Stub.cpp ()
  • (renamed) openmp/device/src/Synchronization.cpp ()
  • (renamed) openmp/device/src/Tasking.cpp ()
  • (renamed) openmp/device/src/Workshare.cpp ()
  • (modified) openmp/docs/SupportAndFAQ.rst (+7)
diff --git a/clang/lib/Driver/ToolChains/CommonArgs.cpp b/clang/lib/Driver/ToolChains/CommonArgs.cpp
index 8646c55060b17..7cc4008ec1f2b 100644
--- a/clang/lib/Driver/ToolChains/CommonArgs.cpp
+++ b/clang/lib/Driver/ToolChains/CommonArgs.cpp
@@ -2794,6 +2794,11 @@ void tools::addOpenMPDeviceRTL(const Driver &D,
   for (const auto &LibPath : HostTC.getFilePaths())
     LibraryPaths.emplace_back(LibPath);
 
+  // Check the target specific library path for the triple as well.
+  SmallString<128> P(D.Dir);
+  llvm::sys::path::append(P, "..", "lib", Triple.getTriple());
+  LibraryPaths.emplace_back(P);
+
   OptSpecifier LibomptargetBCPathOpt =
       Triple.isAMDGCN()  ? options::OPT_libomptarget_amdgpu_bc_path_EQ
       : Triple.isNVPTX() ? options::OPT_libomptarget_nvptx_bc_path_EQ
diff --git a/offload/CMakeLists.txt b/offload/CMakeLists.txt
index 25c879710645c..70ac6a6d1e6c3 100644
--- a/offload/CMakeLists.txt
+++ b/offload/CMakeLists.txt
@@ -113,6 +113,13 @@ else()
   set(CMAKE_CXX_EXTENSIONS NO)
 endif()
 
+# Emit a warning for people who haven't updated their build.
+if(NOT "openmp" IN_LIST RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES AND
+   NOT "openmp" IN_LIST RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES)
+  message(WARNING "Building the offloading runtime with no device library. See "
+                  "https://openmp.llvm.org//SupportAndFAQ.html for help.")
+endif()
+
 # Set the path of all resulting libraries to a unified location so that it can
 # be used for testing.
 set(LIBOMPTARGET_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
@@ -373,7 +380,6 @@ set(LIBOMPTARGET_LLVM_LIBRARY_INTDIR "${LIBOMPTARGET_INTDIR}" CACHE STRING
 
 # Build offloading plugins and device RTLs if they are available.
 add_subdirectory(plugins-nextgen)
-add_subdirectory(DeviceRTL)
 add_subdirectory(tools)
 
 # Build target agnostic offloading library.
diff --git a/offload/DeviceRTL/CMakeLists.txt b/offload/DeviceRTL/CMakeLists.txt
deleted file mode 100644
index 12f53a30761f3..0000000000000
--- a/offload/DeviceRTL/CMakeLists.txt
+++ /dev/null
@@ -1,181 +0,0 @@
-set(LIBOMPTARGET_BUILD_DEVICERTL_BCLIB TRUE CACHE BOOL
-  "Can be set to false to disable building this library.")
-
-if (NOT LIBOMPTARGET_BUILD_DEVICERTL_BCLIB)
-  message(STATUS "Not building DeviceRTL: Disabled by LIBOMPTARGET_BUILD_DEVICERTL_BCLIB")
-  return()
-endif()
-
-# Check to ensure the host system is a supported host architecture.
-if(NOT ${CMAKE_SIZEOF_VOID_P} EQUAL "8")
-  message(STATUS "Not building DeviceRTL: Runtime does not support 32-bit hosts")
-  return()
-endif()
-
-if (LLVM_DIR)
-  # Builds that use pre-installed LLVM have LLVM_DIR set.
-  # A standalone or LLVM_ENABLE_RUNTIMES=openmp build takes this route
-  find_program(CLANG_TOOL clang PATHS ${LLVM_TOOLS_BINARY_DIR} NO_DEFAULT_PATH)
-elseif (LLVM_TOOL_CLANG_BUILD AND NOT CMAKE_CROSSCOMPILING AND NOT OPENMP_STANDALONE_BUILD)
-  # LLVM in-tree builds may use CMake target names to discover the tools.
-  # A LLVM_ENABLE_PROJECTS=openmp build takes this route
-  set(CLANG_TOOL $<TARGET_FILE:clang>)
-else()
-  message(STATUS "Not building DeviceRTL. No appropriate clang found")
-  return()
-endif()
-
-set(devicertl_base_directory ${CMAKE_CURRENT_SOURCE_DIR})
-set(include_directory ${devicertl_base_directory}/include)
-set(source_directory ${devicertl_base_directory}/src)
-
-set(include_files
-  ${include_directory}/Allocator.h
-  ${include_directory}/Configuration.h
-  ${include_directory}/Debug.h
-  ${include_directory}/Interface.h
-  ${include_directory}/LibC.h
-  ${include_directory}/Mapping.h
-  ${include_directory}/Profiling.h
-  ${include_directory}/State.h
-  ${include_directory}/Synchronization.h
-  ${include_directory}/DeviceTypes.h
-  ${include_directory}/DeviceUtils.h
-  ${include_directory}/Workshare.h
-)
-
-set(src_files
-  ${source_directory}/Allocator.cpp
-  ${source_directory}/Configuration.cpp
-  ${source_directory}/Debug.cpp
-  ${source_directory}/Kernel.cpp
-  ${source_directory}/LibC.cpp
-  ${source_directory}/Mapping.cpp
-  ${source_directory}/Misc.cpp
-  ${source_directory}/Parallelism.cpp
-  ${source_directory}/Profiling.cpp
-  ${source_directory}/Reduction.cpp
-  ${source_directory}/State.cpp
-  ${source_directory}/Synchronization.cpp
-  ${source_directory}/Tasking.cpp
-  ${source_directory}/DeviceUtils.cpp
-  ${source_directory}/Workshare.cpp
-)
-
-# We disable the slp vectorizer during the runtime optimization to avoid
-# vectorized accesses to the shared state. Generally, those are "good" but
-# the optimizer pipeline (esp. Attributor) does not fully support vectorized
-# instructions yet and we end up missing out on way more important constant
-# propagation. That said, we will run the vectorizer again after the runtime
-# has been linked into the user program.
-set(clang_opt_flags -O3 -mllvm -openmp-opt-disable -DSHARED_SCRATCHPAD_SIZE=512 -mllvm -vectorize-slp=false )
-
-# If the user built with the GPU C library enabled we will use that instead.
-if(${LIBOMPTARGET_GPU_LIBC_SUPPORT})
-  list(APPEND clang_opt_flags -DOMPTARGET_HAS_LIBC)
-endif()
-
-# Set flags for LLVM Bitcode compilation.
-set(bc_flags -c -flto -std=c++17 -fvisibility=hidden
-             ${clang_opt_flags} -nogpulib -nostdlibinc
-             -fno-rtti -fno-exceptions -fconvergent-functions
-             -Wno-unknown-cuda-version
-             -DOMPTARGET_DEVICE_RUNTIME
-             -I${include_directory}
-             -I${devicertl_base_directory}/../include
-             -I${devicertl_base_directory}/../../libc
-)
-
-# first create an object target
-function(compileDeviceRTLLibrary target_name target_triple)
-  set(target_bc_flags ${ARGN})
-
-  foreach(src ${src_files})
-    get_filename_component(infile ${src} ABSOLUTE)
-    get_filename_component(outfile ${src} NAME)
-    set(outfile "${outfile}-${target_name}.o")
-    set(depfile "${outfile}.d")
-
-    # Passing an empty CPU to -march= suppressed target specific metadata.
-    add_custom_command(OUTPUT ${outfile}
-      COMMAND ${CLANG_TOOL}
-      ${bc_flags}
-      --target=${target_triple}
-      ${target_bc_flags}
-      -MD -MF ${depfile}
-      ${infile} -o ${outfile}
-      DEPENDS ${infile}
-      DEPFILE ${depfile}
-      COMMENT "Building LLVM bitcode ${outfile}"
-      VERBATIM
-    )
-    if(TARGET clang)
-      # Add a file-level dependency to ensure that clang is up-to-date.
-      # By default, add_custom_command only builds clang if the
-      # executable is missing.
-      add_custom_command(OUTPUT ${outfile}
-        DEPENDS clang
-        APPEND
-      )
-    endif()
-    set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile})
-
-    list(APPEND obj_files ${CMAKE_CURRENT_BINARY_DIR}/${outfile})
-  endforeach()
-  # Trick to combine these into a bitcode file via the linker's LTO pass. This
-  # is used to provide the legacy `libomptarget-<name>.bc` files. Hack this
-  # through as an executable to get it to use the relocatable link.
-  add_executable(libomptarget-${target_name} ${obj_files})
-  set_target_properties(libomptarget-${target_name} PROPERTIES
-    RUNTIME_OUTPUT_DIRECTORY ${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}
-    LINKER_LANGUAGE CXX
-    BUILD_RPATH ""
-    INSTALL_RPATH ""
-    RUNTIME_OUTPUT_NAME libomptarget-${target_name}.bc)
-  target_compile_options(libomptarget-${target_name} PRIVATE "--target=${target_triple}" "-march=")
-  target_link_options(libomptarget-${target_name} PRIVATE "--target=${target_triple}"
-                      "-r" "-nostdlib" "-flto" "-Wl,--lto-emit-llvm" "-march=")
-  install(TARGETS libomptarget-${target_name}
-          PERMISSIONS OWNER_WRITE OWNER_READ GROUP_READ WORLD_READ
-          DESTINATION ${OFFLOAD_INSTALL_LIBDIR})
-
-  add_library(omptarget.${target_name}.all_objs OBJECT IMPORTED)
-  set_property(TARGET omptarget.${target_name}.all_objs APPEND PROPERTY IMPORTED_OBJECTS
-               ${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}/libomptarget-${target_name}.bc)
-
-  # Archive all the object files generated above into a static library
-  add_library(omptarget.${target_name} STATIC)
-  set_target_properties(omptarget.${target_name} PROPERTIES
-    ARCHIVE_OUTPUT_DIRECTORY "${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}/${target_triple}"
-    ARCHIVE_OUTPUT_NAME ompdevice
-    LINKER_LANGUAGE CXX
-  )
-  target_link_libraries(omptarget.${target_name} PRIVATE omptarget.${target_name}.all_objs)
-
-  install(TARGETS omptarget.${target_name}
-          ARCHIVE DESTINATION "lib${LLVM_LIBDIR_SUFFIX}/${target_triple}")
-
-  if (CMAKE_EXPORT_COMPILE_COMMANDS)
-    set(ide_target_name omptarget-ide-${target_name})
-    add_library(${ide_target_name} STATIC EXCLUDE_FROM_ALL ${src_files})
-    target_compile_options(${ide_target_name} PRIVATE
-      -fvisibility=hidden --target=${target_triple}
-      -nogpulib -nostdlibinc -Wno-unknown-cuda-version
-    )
-    target_compile_definitions(${ide_target_name} PRIVATE SHARED_SCRATCHPAD_SIZE=512)
-    target_include_directories(${ide_target_name} PRIVATE
-      ${include_directory}
-      ${devicertl_base_directory}/../../libc
-      ${devicertl_base_directory}/../include
-    )
-    install(TARGETS ${ide_target_name} EXCLUDE_FROM_ALL)
-  endif()
-endfunction()
-
-if(NOT LLVM_TARGETS_TO_BUILD OR "AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD)
-  compileDeviceRTLLibrary(amdgpu amdgcn-amd-amdhsa -Xclang -mcode-object-version=none)
-endif()
-
-if(NOT LLVM_TARGETS_TO_BUILD OR "NVPTX" IN_LIST LLVM_TARGETS_TO_BUILD)
-  compileDeviceRTLLibrary(nvptx nvptx64-nvidia-cuda --cuda-feature=+ptx63)
-endif()
diff --git a/offload/cmake/caches/Offload.cmake b/offload/cmake/caches/Offload.cmake
index 5533a6508f5d5..3747a1d3eb299 100644
--- a/offload/cmake/caches/Offload.cmake
+++ b/offload/cmake/caches/Offload.cmake
@@ -5,5 +5,5 @@ set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "")
 set(LLVM_RUNTIME_TARGETS default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda CACHE STRING "") 
 set(RUNTIMES_nvptx64-nvidia-cuda_CACHE_FILES "${CMAKE_SOURCE_DIR}/../libcxx/cmake/caches/NVPTX.cmake" CACHE STRING "")
 set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES "${CMAKE_SOURCE_DIR}/../libcxx/cmake/caches/AMDGPU.cmake" CACHE STRING "")
-set(RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;libcxx;libcxxabi" CACHE STRING "")
-set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;libcxx;libcxxabi" CACHE STRING "")
+set(RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;openmp;libcxx;libcxxabi" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;openmp;libcxx;libcxxabi" CACHE STRING "")
diff --git a/openmp/CMakeLists.txt b/openmp/CMakeLists.txt
index c206386fa6b61..c1c533d00f8bb 100644
--- a/openmp/CMakeLists.txt
+++ b/openmp/CMakeLists.txt
@@ -88,6 +88,14 @@ else()
   set(CMAKE_CXX_EXTENSIONS NO)
 endif()
 
+# Targeting the GPU directly requires a few flags to make CMake happy.
+if("${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn")
+  set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -nogpulib")
+elseif("${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^nvptx")
+  set(CMAKE_REQUIRED_FLAGS
+      "${CMAKE_REQUIRED_FLAGS} -flto -c -Wno-unused-command-line-argument")
+endif()
+
 # Check and set up common compiler flags.
 include(config-ix)
 include(HandleOpenMPOptions)
@@ -122,35 +130,41 @@ else()
   get_clang_resource_dir(LIBOMP_HEADERS_INSTALL_PATH SUBDIR include)
 endif()
 
-# Build host runtime library, after LIBOMPTARGET variables are set since they are needed
-# to enable time profiling support in the OpenMP runtime.
-add_subdirectory(runtime)
-
-set(ENABLE_OMPT_TOOLS ON)
-# Currently tools are not tested well on Windows or MacOS X.
-if (APPLE OR WIN32)
-  set(ENABLE_OMPT_TOOLS OFF)
-endif()
-
-option(OPENMP_ENABLE_OMPT_TOOLS "Enable building ompt based tools for OpenMP."
-       ${ENABLE_OMPT_TOOLS})
-if (OPENMP_ENABLE_OMPT_TOOLS)
-  add_subdirectory(tools)
-endif()
-
-# Propagate OMPT support to offload
-if(NOT ${OPENMP_STANDALONE_BUILD})
-  set(LIBOMP_HAVE_OMPT_SUPPORT ${LIBOMP_HAVE_OMPT_SUPPORT} PARENT_SCOPE)
-  set(LIBOMP_OMP_TOOLS_INCLUDE_DIR ${LIBOMP_OMP_TOOLS_INCLUDE_DIR} PARENT_SCOPE)
+# Use the current compiler target to determine the appropriate runtime to build.
+if("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^amdgcn|^nvptx" OR
+   "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn|^nvptx")
+  add_subdirectory(device)
+else()
+  # Build host runtime library, after LIBOMPTARGET variables are set since they
+  # are needed to enable time profiling support in the OpenMP runtime.
+  add_subdirectory(runtime)
+  
+  set(ENABLE_OMPT_TOOLS ON)
+  # Currently tools are not tested well on Windows or MacOS X.
+  if (APPLE OR WIN32)
+    set(ENABLE_OMPT_TOOLS OFF)
+  endif()
+  
+  option(OPENMP_ENABLE_OMPT_TOOLS "Enable building ompt based tools for OpenMP."
+         ${ENABLE_OMPT_TOOLS})
+  if (OPENMP_ENABLE_OMPT_TOOLS)
+    add_subdirectory(tools)
+  endif()
+  
+  # Propagate OMPT support to offload
+  if(NOT ${OPENMP_STANDALONE_BUILD})
+    set(LIBOMP_HAVE_OMPT_SUPPORT ${LIBOMP_HAVE_OMPT_SUPPORT} PARENT_SCOPE)
+    set(LIBOMP_OMP_TOOLS_INCLUDE_DIR ${LIBOMP_OMP_TOOLS_INCLUDE_DIR} PARENT_SCOPE)
+  endif()
+  
+  option(OPENMP_MSVC_NAME_SCHEME "Build dll with MSVC naming scheme." OFF)
+  
+  # Build libompd.so
+  add_subdirectory(libompd)
+  
+  # Build documentation
+  add_subdirectory(docs)
+  
+  # Now that we have seen all testsuites, create the check-openmp target.
+  construct_check_openmp_target()
 endif()
-
-option(OPENMP_MSVC_NAME_SCHEME "Build dll with MSVC naming scheme." OFF)
-
-# Build libompd.so
-add_subdirectory(libompd)
-
-# Build documentation
-add_subdirectory(docs)
-
-# Now that we have seen all testsuites, create the check-openmp target.
-construct_check_openmp_target()
diff --git a/openmp/device/CMakeLists.txt b/openmp/device/CMakeLists.txt
new file mode 100644
index 0000000000000..9211186f4012a
--- /dev/null
+++ b/openmp/device/CMakeLists.txt
@@ -0,0 +1,99 @@
+# Ensure the compiler is a valid clang when building the GPU target.
+set(req_ver "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH}")
+if(LLVM_VERSION_MAJOR AND NOT (CMAKE_CXX_COMPILER_ID MATCHES "[Cc]lang" AND
+   ${CMAKE_CXX_COMPILER_VERSION} VERSION_EQUAL "${req_ver}"))
+  message(FATAL_ERROR "Cannot build GPU device runtime. CMake compiler "
+                      "'${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}' "
+                      " is not 'Clang ${req_ver}'.")
+endif()
+
+set(src_files
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Allocator.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Configuration.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Debug.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Kernel.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/LibC.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Mapping.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Misc.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Parallelism.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Profiling.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Reduction.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/State.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Synchronization.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Tasking.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/DeviceUtils.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Workshare.cpp
+)
+
+list(APPEND compile_options -flto)
+list(APPEND compile_options -fvisibility=hidden)
+list(APPEND compile_options -nogpulib)
+list(APPEND compile_options -nostdlibinc)
+list(APPEND compile_options -fno-rtti)
+list(APPEND compile_options -fno-exceptions)
+list(APPEND compile_options -fconvergent-functions)
+list(APPEND compile_options -Wno-unknown-cuda-version)
+if(LLVM_DEFAULT_TARGET_TRIPLE)
+  list(APPEND compile_options --target=${LLVM_DEFAULT_TARGET_TRIPLE})
+endif()
+
+# We disable the slp vectorizer during the runtime optimization to avoid
+# vectorized accesses to the shared state. Generally, those are "good" but
+# the optimizer pipeline (esp. Attributor) does not fully support vectorized
+# instructions yet and we end up missing out on way more important constant
+# propagation. That said, we will run the vectorizer again after the runtime
+# has been linked into the user program.
+list(APPEND compile_flags "SHELL: -mllvm -vectorize-slp=false")
+if("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^amdgcn" OR
+   "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn")
+  set(target_name "amdgpu")
+  list(APPEND compile_flags "SHELL:-Xclang -mcode-object-version=none")
+elseif("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^nvptx" OR
+       "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^nvptx")
+  set(target_name "nvptx")
+  list(APPEND compile_flags --cuda-feature=+ptx63)
+endif()
+
+# Trick to combine these into a bitcode file via the linker's LTO pass.
+add_executable(libompdevice ${src_files})
+set_target_properties(libompdevice PROPERTIES
+  RUNTIME_OUTPUT_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
+  LINKER_LANGUAGE CXX
+  BUILD_RPATH ""
+  INSTALL_RPATH ""
+  RUNTIME_OUTPUT_NAME libomptarget-${target_name}.bc)
+
+# If the user built with the GPU C library enabled we will use that instead.
+if(LIBOMPTARGET_GPU_LIBC_SUPPORT)
+  target_compile_definitions(libompdevice PRIVATE OMPTARGET_HAS_LIBC)
+endif()
+target_compile_definitions(libompdevice PRIVATE SHARED_SCRATCHPAD_SIZE=512)
+
+target_include_directories(libompdevice PRIVATE 
+                           ${CMAKE_CURRENT_SOURCE_DIR}/include
+                           ${CMAKE_CURRENT_SOURCE_DIR}/../../libc
+                           ${CMAKE_CURRENT_SOURCE_DIR}/../../offload/include)
+target_compile_options(libompdevice PRIVATE ${compile_options})
+target_link_options(libompdevice PRIVATE
+                    "-flto" "-r" "-nostdlib" "-Wl,--lto-emit-llvm")
+if(LLVM_DEFAULT_TARGET_TRIPLE)
+  target_link_options(libompdevice PRIVATE "--target=${LLVM_DEFAULT_TARGET_TRIPLE}")
+endif()
+install(TARGETS libompdevice
+        PERMISSIONS OWNER_WRITE OWNER_READ GROUP_READ WORLD_READ
+        DESTINATION ${OPENMP_INSTALL_LIBDIR})
+
+add_library(ompdevice.all_objs OBJECT IMPORTED)
+set_property(TARGET ompdevice.all_objs APPEND PROPERTY IMPORTED_OBJECTS
+             ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-${target_name}.bc)
+
+# Archive all the object files generated above into a static library
+add_library(ompdevice STATIC)
+add_dependencies(ompdevice libompdevice)
+set_target_properties(ompdevice PROPERTIES
+  ARCHIVE_OUTPUT_DIRECTORY "${OPENMP_INSTALL_LIBDIR}"
+  ARCHIVE_OUTPUT_NAME ompdevice
+  LINKER_LANGUAGE CXX
+)
+target_link_libraries(ompdevice PRIVATE ompdevice.all_objs)
+install(TARGETS ompdevice ARCHIVE DESTINATION "${OPENMP_INSTALL_LIBDIR}")
diff --git a/offload/DeviceRTL/include/Allocator.h b/openmp/device/include/Allocator.h
similarity index 100%
rename from offload/DeviceRTL/include/Allocator.h
rename to openmp/device/include/Allocator.h
diff --git a/offload/DeviceRTL/include/Configuration.h b/openmp/device/include/Configuration.h
similarity index 100%
rename from offload/DeviceRTL/include/Configuration.h
rename to openmp/device/include/Configuration.h
diff --git a/offload/DeviceRTL/include/Debug.h b/openmp/device/include/Debug.h
similarity index 100%
rename from offload/DeviceRTL/include/Debug.h
rename to openmp/device/include/Debug.h
diff --git a/offload/DeviceRTL/include/DeviceTypes.h b/openmp/device/include/DeviceTypes.h
similarity index 100%
rename from offload/DeviceRTL/include/DeviceTypes.h
rename to openmp/device/include/DeviceTypes.h
diff --git a/offload/DeviceRTL/include/DeviceUtils.h b/openmp/device/include/DeviceUtils.h
similarity index 100%
rename from offload/DeviceRTL/include/DeviceUtils.h
rename to openmp/device/include/DeviceUtils.h
diff --git a/offload/DeviceRTL/include/Interface.h b/openmp/device/include/Interface.h
similarity index 100%
rename from offload/DeviceRTL/include/Interface.h
rename to openmp/device/include/Interface.h
diff --git a/offload/DeviceRTL/include/LibC.h b/openmp/device/include/LibC.h
similarity index 100%
rename from offload/DeviceRTL/include/LibC.h
rename to openmp/device/include/LibC.h
diff --git a/offload/DeviceRTL/include/Mapping.h b/openmp/device/include/Mapping.h
similarity index 100%
rename from offload/DeviceRTL/include/Mapping.h
rename to openmp/device/include/Mapping.h
diff --git a/offload/DeviceRTL/include/Profiling.h b/openmp/device/include/Profiling.h
similarity index 100%
rename from offload/DeviceRTL/include/Profiling.h
rename to openmp/device/include/Profiling.h
diff --git a/offload/DeviceRTL/include/State.h b/openmp/device/include/State.h
similarity index 100%
rename from offload/Dev...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2025

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
Currently we build the OpenMP device runtime as part of the offload/
project. This is problematic because it has several restrictions when
compared to the normal offloading runtime. It can only be built with an
up-to-date clang and we need to set the target appropriately. Currently
we hack around this by creating the compiler invocation manually, but
this patch moves it into a separate runtimes build.

This follows the same build we use for libc, libc++, compiler-rt, and
flang-rt. This also moves it from offload/ into openmp/ because it
is still the openmp/ runtime and I feel it is more appropriate. We do
want a generic offload/ library at some point, but it would be trivial
to then add that as a separate library now that we have the
infrastructure that makes adding these new libraries trivial.

This most importantly will require that users update their build
configs, mostly adding the following lines at a minimum. I was debating
whether or not I should 'auto-upgrade' this, but I just went with a
warning.

    -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda'     \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \

This also changed where the .bc version of the library lives, but it's
still created.


Patch is 24.72 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/136729.diff

36 Files Affected:

  • (modified) clang/lib/Driver/ToolChains/CommonArgs.cpp (+5)
  • (modified) offload/CMakeLists.txt (+7-1)
  • (removed) offload/DeviceRTL/CMakeLists.txt (-181)
  • (modified) offload/cmake/caches/Offload.cmake (+2-2)
  • (modified) openmp/CMakeLists.txt (+45-31)
  • (added) openmp/device/CMakeLists.txt (+99)
  • (renamed) openmp/device/include/Allocator.h ()
  • (renamed) openmp/device/include/Configuration.h ()
  • (renamed) openmp/device/include/Debug.h ()
  • (renamed) openmp/device/include/DeviceTypes.h ()
  • (renamed) openmp/device/include/DeviceUtils.h ()
  • (renamed) openmp/device/include/Interface.h ()
  • (renamed) openmp/device/include/LibC.h ()
  • (renamed) openmp/device/include/Mapping.h ()
  • (renamed) openmp/device/include/Profiling.h ()
  • (renamed) openmp/device/include/State.h ()
  • (renamed) openmp/device/include/Synchronization.h ()
  • (renamed) openmp/device/include/Workshare.h ()
  • (renamed) openmp/device/include/generated_microtask_cases.gen ()
  • (renamed) openmp/device/src/Allocator.cpp ()
  • (renamed) openmp/device/src/Configuration.cpp ()
  • (renamed) openmp/device/src/Debug.cpp ()
  • (renamed) openmp/device/src/DeviceUtils.cpp ()
  • (renamed) openmp/device/src/Kernel.cpp ()
  • (renamed) openmp/device/src/LibC.cpp ()
  • (renamed) openmp/device/src/Mapping.cpp ()
  • (renamed) openmp/device/src/Misc.cpp ()
  • (renamed) openmp/device/src/Parallelism.cpp ()
  • (renamed) openmp/device/src/Profiling.cpp ()
  • (renamed) openmp/device/src/Reduction.cpp ()
  • (renamed) openmp/device/src/State.cpp ()
  • (renamed) openmp/device/src/Stub.cpp ()
  • (renamed) openmp/device/src/Synchronization.cpp ()
  • (renamed) openmp/device/src/Tasking.cpp ()
  • (renamed) openmp/device/src/Workshare.cpp ()
  • (modified) openmp/docs/SupportAndFAQ.rst (+7)
diff --git a/clang/lib/Driver/ToolChains/CommonArgs.cpp b/clang/lib/Driver/ToolChains/CommonArgs.cpp
index 8646c55060b17..7cc4008ec1f2b 100644
--- a/clang/lib/Driver/ToolChains/CommonArgs.cpp
+++ b/clang/lib/Driver/ToolChains/CommonArgs.cpp
@@ -2794,6 +2794,11 @@ void tools::addOpenMPDeviceRTL(const Driver &D,
   for (const auto &LibPath : HostTC.getFilePaths())
     LibraryPaths.emplace_back(LibPath);
 
+  // Check the target specific library path for the triple as well.
+  SmallString<128> P(D.Dir);
+  llvm::sys::path::append(P, "..", "lib", Triple.getTriple());
+  LibraryPaths.emplace_back(P);
+
   OptSpecifier LibomptargetBCPathOpt =
       Triple.isAMDGCN()  ? options::OPT_libomptarget_amdgpu_bc_path_EQ
       : Triple.isNVPTX() ? options::OPT_libomptarget_nvptx_bc_path_EQ
diff --git a/offload/CMakeLists.txt b/offload/CMakeLists.txt
index 25c879710645c..70ac6a6d1e6c3 100644
--- a/offload/CMakeLists.txt
+++ b/offload/CMakeLists.txt
@@ -113,6 +113,13 @@ else()
   set(CMAKE_CXX_EXTENSIONS NO)
 endif()
 
+# Emit a warning for people who haven't updated their build.
+if(NOT "openmp" IN_LIST RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES AND
+   NOT "openmp" IN_LIST RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES)
+  message(WARNING "Building the offloading runtime with no device library. See "
+                  "https://openmp.llvm.org//SupportAndFAQ.html for help.")
+endif()
+
 # Set the path of all resulting libraries to a unified location so that it can
 # be used for testing.
 set(LIBOMPTARGET_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
@@ -373,7 +380,6 @@ set(LIBOMPTARGET_LLVM_LIBRARY_INTDIR "${LIBOMPTARGET_INTDIR}" CACHE STRING
 
 # Build offloading plugins and device RTLs if they are available.
 add_subdirectory(plugins-nextgen)
-add_subdirectory(DeviceRTL)
 add_subdirectory(tools)
 
 # Build target agnostic offloading library.
diff --git a/offload/DeviceRTL/CMakeLists.txt b/offload/DeviceRTL/CMakeLists.txt
deleted file mode 100644
index 12f53a30761f3..0000000000000
--- a/offload/DeviceRTL/CMakeLists.txt
+++ /dev/null
@@ -1,181 +0,0 @@
-set(LIBOMPTARGET_BUILD_DEVICERTL_BCLIB TRUE CACHE BOOL
-  "Can be set to false to disable building this library.")
-
-if (NOT LIBOMPTARGET_BUILD_DEVICERTL_BCLIB)
-  message(STATUS "Not building DeviceRTL: Disabled by LIBOMPTARGET_BUILD_DEVICERTL_BCLIB")
-  return()
-endif()
-
-# Check to ensure the host system is a supported host architecture.
-if(NOT ${CMAKE_SIZEOF_VOID_P} EQUAL "8")
-  message(STATUS "Not building DeviceRTL: Runtime does not support 32-bit hosts")
-  return()
-endif()
-
-if (LLVM_DIR)
-  # Builds that use pre-installed LLVM have LLVM_DIR set.
-  # A standalone or LLVM_ENABLE_RUNTIMES=openmp build takes this route
-  find_program(CLANG_TOOL clang PATHS ${LLVM_TOOLS_BINARY_DIR} NO_DEFAULT_PATH)
-elseif (LLVM_TOOL_CLANG_BUILD AND NOT CMAKE_CROSSCOMPILING AND NOT OPENMP_STANDALONE_BUILD)
-  # LLVM in-tree builds may use CMake target names to discover the tools.
-  # A LLVM_ENABLE_PROJECTS=openmp build takes this route
-  set(CLANG_TOOL $<TARGET_FILE:clang>)
-else()
-  message(STATUS "Not building DeviceRTL. No appropriate clang found")
-  return()
-endif()
-
-set(devicertl_base_directory ${CMAKE_CURRENT_SOURCE_DIR})
-set(include_directory ${devicertl_base_directory}/include)
-set(source_directory ${devicertl_base_directory}/src)
-
-set(include_files
-  ${include_directory}/Allocator.h
-  ${include_directory}/Configuration.h
-  ${include_directory}/Debug.h
-  ${include_directory}/Interface.h
-  ${include_directory}/LibC.h
-  ${include_directory}/Mapping.h
-  ${include_directory}/Profiling.h
-  ${include_directory}/State.h
-  ${include_directory}/Synchronization.h
-  ${include_directory}/DeviceTypes.h
-  ${include_directory}/DeviceUtils.h
-  ${include_directory}/Workshare.h
-)
-
-set(src_files
-  ${source_directory}/Allocator.cpp
-  ${source_directory}/Configuration.cpp
-  ${source_directory}/Debug.cpp
-  ${source_directory}/Kernel.cpp
-  ${source_directory}/LibC.cpp
-  ${source_directory}/Mapping.cpp
-  ${source_directory}/Misc.cpp
-  ${source_directory}/Parallelism.cpp
-  ${source_directory}/Profiling.cpp
-  ${source_directory}/Reduction.cpp
-  ${source_directory}/State.cpp
-  ${source_directory}/Synchronization.cpp
-  ${source_directory}/Tasking.cpp
-  ${source_directory}/DeviceUtils.cpp
-  ${source_directory}/Workshare.cpp
-)
-
-# We disable the slp vectorizer during the runtime optimization to avoid
-# vectorized accesses to the shared state. Generally, those are "good" but
-# the optimizer pipeline (esp. Attributor) does not fully support vectorized
-# instructions yet and we end up missing out on way more important constant
-# propagation. That said, we will run the vectorizer again after the runtime
-# has been linked into the user program.
-set(clang_opt_flags -O3 -mllvm -openmp-opt-disable -DSHARED_SCRATCHPAD_SIZE=512 -mllvm -vectorize-slp=false )
-
-# If the user built with the GPU C library enabled we will use that instead.
-if(${LIBOMPTARGET_GPU_LIBC_SUPPORT})
-  list(APPEND clang_opt_flags -DOMPTARGET_HAS_LIBC)
-endif()
-
-# Set flags for LLVM Bitcode compilation.
-set(bc_flags -c -flto -std=c++17 -fvisibility=hidden
-             ${clang_opt_flags} -nogpulib -nostdlibinc
-             -fno-rtti -fno-exceptions -fconvergent-functions
-             -Wno-unknown-cuda-version
-             -DOMPTARGET_DEVICE_RUNTIME
-             -I${include_directory}
-             -I${devicertl_base_directory}/../include
-             -I${devicertl_base_directory}/../../libc
-)
-
-# first create an object target
-function(compileDeviceRTLLibrary target_name target_triple)
-  set(target_bc_flags ${ARGN})
-
-  foreach(src ${src_files})
-    get_filename_component(infile ${src} ABSOLUTE)
-    get_filename_component(outfile ${src} NAME)
-    set(outfile "${outfile}-${target_name}.o")
-    set(depfile "${outfile}.d")
-
-    # Passing an empty CPU to -march= suppressed target specific metadata.
-    add_custom_command(OUTPUT ${outfile}
-      COMMAND ${CLANG_TOOL}
-      ${bc_flags}
-      --target=${target_triple}
-      ${target_bc_flags}
-      -MD -MF ${depfile}
-      ${infile} -o ${outfile}
-      DEPENDS ${infile}
-      DEPFILE ${depfile}
-      COMMENT "Building LLVM bitcode ${outfile}"
-      VERBATIM
-    )
-    if(TARGET clang)
-      # Add a file-level dependency to ensure that clang is up-to-date.
-      # By default, add_custom_command only builds clang if the
-      # executable is missing.
-      add_custom_command(OUTPUT ${outfile}
-        DEPENDS clang
-        APPEND
-      )
-    endif()
-    set_property(DIRECTORY APPEND PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${outfile})
-
-    list(APPEND obj_files ${CMAKE_CURRENT_BINARY_DIR}/${outfile})
-  endforeach()
-  # Trick to combine these into a bitcode file via the linker's LTO pass. This
-  # is used to provide the legacy `libomptarget-<name>.bc` files. Hack this
-  # through as an executable to get it to use the relocatable link.
-  add_executable(libomptarget-${target_name} ${obj_files})
-  set_target_properties(libomptarget-${target_name} PROPERTIES
-    RUNTIME_OUTPUT_DIRECTORY ${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}
-    LINKER_LANGUAGE CXX
-    BUILD_RPATH ""
-    INSTALL_RPATH ""
-    RUNTIME_OUTPUT_NAME libomptarget-${target_name}.bc)
-  target_compile_options(libomptarget-${target_name} PRIVATE "--target=${target_triple}" "-march=")
-  target_link_options(libomptarget-${target_name} PRIVATE "--target=${target_triple}"
-                      "-r" "-nostdlib" "-flto" "-Wl,--lto-emit-llvm" "-march=")
-  install(TARGETS libomptarget-${target_name}
-          PERMISSIONS OWNER_WRITE OWNER_READ GROUP_READ WORLD_READ
-          DESTINATION ${OFFLOAD_INSTALL_LIBDIR})
-
-  add_library(omptarget.${target_name}.all_objs OBJECT IMPORTED)
-  set_property(TARGET omptarget.${target_name}.all_objs APPEND PROPERTY IMPORTED_OBJECTS
-               ${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}/libomptarget-${target_name}.bc)
-
-  # Archive all the object files generated above into a static library
-  add_library(omptarget.${target_name} STATIC)
-  set_target_properties(omptarget.${target_name} PROPERTIES
-    ARCHIVE_OUTPUT_DIRECTORY "${LIBOMPTARGET_LLVM_LIBRARY_INTDIR}/${target_triple}"
-    ARCHIVE_OUTPUT_NAME ompdevice
-    LINKER_LANGUAGE CXX
-  )
-  target_link_libraries(omptarget.${target_name} PRIVATE omptarget.${target_name}.all_objs)
-
-  install(TARGETS omptarget.${target_name}
-          ARCHIVE DESTINATION "lib${LLVM_LIBDIR_SUFFIX}/${target_triple}")
-
-  if (CMAKE_EXPORT_COMPILE_COMMANDS)
-    set(ide_target_name omptarget-ide-${target_name})
-    add_library(${ide_target_name} STATIC EXCLUDE_FROM_ALL ${src_files})
-    target_compile_options(${ide_target_name} PRIVATE
-      -fvisibility=hidden --target=${target_triple}
-      -nogpulib -nostdlibinc -Wno-unknown-cuda-version
-    )
-    target_compile_definitions(${ide_target_name} PRIVATE SHARED_SCRATCHPAD_SIZE=512)
-    target_include_directories(${ide_target_name} PRIVATE
-      ${include_directory}
-      ${devicertl_base_directory}/../../libc
-      ${devicertl_base_directory}/../include
-    )
-    install(TARGETS ${ide_target_name} EXCLUDE_FROM_ALL)
-  endif()
-endfunction()
-
-if(NOT LLVM_TARGETS_TO_BUILD OR "AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD)
-  compileDeviceRTLLibrary(amdgpu amdgcn-amd-amdhsa -Xclang -mcode-object-version=none)
-endif()
-
-if(NOT LLVM_TARGETS_TO_BUILD OR "NVPTX" IN_LIST LLVM_TARGETS_TO_BUILD)
-  compileDeviceRTLLibrary(nvptx nvptx64-nvidia-cuda --cuda-feature=+ptx63)
-endif()
diff --git a/offload/cmake/caches/Offload.cmake b/offload/cmake/caches/Offload.cmake
index 5533a6508f5d5..3747a1d3eb299 100644
--- a/offload/cmake/caches/Offload.cmake
+++ b/offload/cmake/caches/Offload.cmake
@@ -5,5 +5,5 @@ set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "")
 set(LLVM_RUNTIME_TARGETS default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda CACHE STRING "") 
 set(RUNTIMES_nvptx64-nvidia-cuda_CACHE_FILES "${CMAKE_SOURCE_DIR}/../libcxx/cmake/caches/NVPTX.cmake" CACHE STRING "")
 set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES "${CMAKE_SOURCE_DIR}/../libcxx/cmake/caches/AMDGPU.cmake" CACHE STRING "")
-set(RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;libcxx;libcxxabi" CACHE STRING "")
-set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;libcxx;libcxxabi" CACHE STRING "")
+set(RUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;openmp;libcxx;libcxxabi" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc;openmp;libcxx;libcxxabi" CACHE STRING "")
diff --git a/openmp/CMakeLists.txt b/openmp/CMakeLists.txt
index c206386fa6b61..c1c533d00f8bb 100644
--- a/openmp/CMakeLists.txt
+++ b/openmp/CMakeLists.txt
@@ -88,6 +88,14 @@ else()
   set(CMAKE_CXX_EXTENSIONS NO)
 endif()
 
+# Targeting the GPU directly requires a few flags to make CMake happy.
+if("${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn")
+  set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -nogpulib")
+elseif("${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^nvptx")
+  set(CMAKE_REQUIRED_FLAGS
+      "${CMAKE_REQUIRED_FLAGS} -flto -c -Wno-unused-command-line-argument")
+endif()
+
 # Check and set up common compiler flags.
 include(config-ix)
 include(HandleOpenMPOptions)
@@ -122,35 +130,41 @@ else()
   get_clang_resource_dir(LIBOMP_HEADERS_INSTALL_PATH SUBDIR include)
 endif()
 
-# Build host runtime library, after LIBOMPTARGET variables are set since they are needed
-# to enable time profiling support in the OpenMP runtime.
-add_subdirectory(runtime)
-
-set(ENABLE_OMPT_TOOLS ON)
-# Currently tools are not tested well on Windows or MacOS X.
-if (APPLE OR WIN32)
-  set(ENABLE_OMPT_TOOLS OFF)
-endif()
-
-option(OPENMP_ENABLE_OMPT_TOOLS "Enable building ompt based tools for OpenMP."
-       ${ENABLE_OMPT_TOOLS})
-if (OPENMP_ENABLE_OMPT_TOOLS)
-  add_subdirectory(tools)
-endif()
-
-# Propagate OMPT support to offload
-if(NOT ${OPENMP_STANDALONE_BUILD})
-  set(LIBOMP_HAVE_OMPT_SUPPORT ${LIBOMP_HAVE_OMPT_SUPPORT} PARENT_SCOPE)
-  set(LIBOMP_OMP_TOOLS_INCLUDE_DIR ${LIBOMP_OMP_TOOLS_INCLUDE_DIR} PARENT_SCOPE)
+# Use the current compiler target to determine the appropriate runtime to build.
+if("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^amdgcn|^nvptx" OR
+   "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn|^nvptx")
+  add_subdirectory(device)
+else()
+  # Build host runtime library, after LIBOMPTARGET variables are set since they
+  # are needed to enable time profiling support in the OpenMP runtime.
+  add_subdirectory(runtime)
+  
+  set(ENABLE_OMPT_TOOLS ON)
+  # Currently tools are not tested well on Windows or MacOS X.
+  if (APPLE OR WIN32)
+    set(ENABLE_OMPT_TOOLS OFF)
+  endif()
+  
+  option(OPENMP_ENABLE_OMPT_TOOLS "Enable building ompt based tools for OpenMP."
+         ${ENABLE_OMPT_TOOLS})
+  if (OPENMP_ENABLE_OMPT_TOOLS)
+    add_subdirectory(tools)
+  endif()
+  
+  # Propagate OMPT support to offload
+  if(NOT ${OPENMP_STANDALONE_BUILD})
+    set(LIBOMP_HAVE_OMPT_SUPPORT ${LIBOMP_HAVE_OMPT_SUPPORT} PARENT_SCOPE)
+    set(LIBOMP_OMP_TOOLS_INCLUDE_DIR ${LIBOMP_OMP_TOOLS_INCLUDE_DIR} PARENT_SCOPE)
+  endif()
+  
+  option(OPENMP_MSVC_NAME_SCHEME "Build dll with MSVC naming scheme." OFF)
+  
+  # Build libompd.so
+  add_subdirectory(libompd)
+  
+  # Build documentation
+  add_subdirectory(docs)
+  
+  # Now that we have seen all testsuites, create the check-openmp target.
+  construct_check_openmp_target()
 endif()
-
-option(OPENMP_MSVC_NAME_SCHEME "Build dll with MSVC naming scheme." OFF)
-
-# Build libompd.so
-add_subdirectory(libompd)
-
-# Build documentation
-add_subdirectory(docs)
-
-# Now that we have seen all testsuites, create the check-openmp target.
-construct_check_openmp_target()
diff --git a/openmp/device/CMakeLists.txt b/openmp/device/CMakeLists.txt
new file mode 100644
index 0000000000000..9211186f4012a
--- /dev/null
+++ b/openmp/device/CMakeLists.txt
@@ -0,0 +1,99 @@
+# Ensure the compiler is a valid clang when building the GPU target.
+set(req_ver "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH}")
+if(LLVM_VERSION_MAJOR AND NOT (CMAKE_CXX_COMPILER_ID MATCHES "[Cc]lang" AND
+   ${CMAKE_CXX_COMPILER_VERSION} VERSION_EQUAL "${req_ver}"))
+  message(FATAL_ERROR "Cannot build GPU device runtime. CMake compiler "
+                      "'${CMAKE_CXX_COMPILER_ID} ${CMAKE_CXX_COMPILER_VERSION}' "
+                      " is not 'Clang ${req_ver}'.")
+endif()
+
+set(src_files
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Allocator.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Configuration.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Debug.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Kernel.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/LibC.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Mapping.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Misc.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Parallelism.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Profiling.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Reduction.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/State.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Synchronization.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Tasking.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/DeviceUtils.cpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/src/Workshare.cpp
+)
+
+list(APPEND compile_options -flto)
+list(APPEND compile_options -fvisibility=hidden)
+list(APPEND compile_options -nogpulib)
+list(APPEND compile_options -nostdlibinc)
+list(APPEND compile_options -fno-rtti)
+list(APPEND compile_options -fno-exceptions)
+list(APPEND compile_options -fconvergent-functions)
+list(APPEND compile_options -Wno-unknown-cuda-version)
+if(LLVM_DEFAULT_TARGET_TRIPLE)
+  list(APPEND compile_options --target=${LLVM_DEFAULT_TARGET_TRIPLE})
+endif()
+
+# We disable the slp vectorizer during the runtime optimization to avoid
+# vectorized accesses to the shared state. Generally, those are "good" but
+# the optimizer pipeline (esp. Attributor) does not fully support vectorized
+# instructions yet and we end up missing out on way more important constant
+# propagation. That said, we will run the vectorizer again after the runtime
+# has been linked into the user program.
+list(APPEND compile_flags "SHELL: -mllvm -vectorize-slp=false")
+if("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^amdgcn" OR
+   "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn")
+  set(target_name "amdgpu")
+  list(APPEND compile_flags "SHELL:-Xclang -mcode-object-version=none")
+elseif("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^nvptx" OR
+       "${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^nvptx")
+  set(target_name "nvptx")
+  list(APPEND compile_flags --cuda-feature=+ptx63)
+endif()
+
+# Trick to combine these into a bitcode file via the linker's LTO pass.
+add_executable(libompdevice ${src_files})
+set_target_properties(libompdevice PROPERTIES
+  RUNTIME_OUTPUT_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
+  LINKER_LANGUAGE CXX
+  BUILD_RPATH ""
+  INSTALL_RPATH ""
+  RUNTIME_OUTPUT_NAME libomptarget-${target_name}.bc)
+
+# If the user built with the GPU C library enabled we will use that instead.
+if(LIBOMPTARGET_GPU_LIBC_SUPPORT)
+  target_compile_definitions(libompdevice PRIVATE OMPTARGET_HAS_LIBC)
+endif()
+target_compile_definitions(libompdevice PRIVATE SHARED_SCRATCHPAD_SIZE=512)
+
+target_include_directories(libompdevice PRIVATE 
+                           ${CMAKE_CURRENT_SOURCE_DIR}/include
+                           ${CMAKE_CURRENT_SOURCE_DIR}/../../libc
+                           ${CMAKE_CURRENT_SOURCE_DIR}/../../offload/include)
+target_compile_options(libompdevice PRIVATE ${compile_options})
+target_link_options(libompdevice PRIVATE
+                    "-flto" "-r" "-nostdlib" "-Wl,--lto-emit-llvm")
+if(LLVM_DEFAULT_TARGET_TRIPLE)
+  target_link_options(libompdevice PRIVATE "--target=${LLVM_DEFAULT_TARGET_TRIPLE}")
+endif()
+install(TARGETS libompdevice
+        PERMISSIONS OWNER_WRITE OWNER_READ GROUP_READ WORLD_READ
+        DESTINATION ${OPENMP_INSTALL_LIBDIR})
+
+add_library(ompdevice.all_objs OBJECT IMPORTED)
+set_property(TARGET ompdevice.all_objs APPEND PROPERTY IMPORTED_OBJECTS
+             ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-${target_name}.bc)
+
+# Archive all the object files generated above into a static library
+add_library(ompdevice STATIC)
+add_dependencies(ompdevice libompdevice)
+set_target_properties(ompdevice PROPERTIES
+  ARCHIVE_OUTPUT_DIRECTORY "${OPENMP_INSTALL_LIBDIR}"
+  ARCHIVE_OUTPUT_NAME ompdevice
+  LINKER_LANGUAGE CXX
+)
+target_link_libraries(ompdevice PRIVATE ompdevice.all_objs)
+install(TARGETS ompdevice ARCHIVE DESTINATION "${OPENMP_INSTALL_LIBDIR}")
diff --git a/offload/DeviceRTL/include/Allocator.h b/openmp/device/include/Allocator.h
similarity index 100%
rename from offload/DeviceRTL/include/Allocator.h
rename to openmp/device/include/Allocator.h
diff --git a/offload/DeviceRTL/include/Configuration.h b/openmp/device/include/Configuration.h
similarity index 100%
rename from offload/DeviceRTL/include/Configuration.h
rename to openmp/device/include/Configuration.h
diff --git a/offload/DeviceRTL/include/Debug.h b/openmp/device/include/Debug.h
similarity index 100%
rename from offload/DeviceRTL/include/Debug.h
rename to openmp/device/include/Debug.h
diff --git a/offload/DeviceRTL/include/DeviceTypes.h b/openmp/device/include/DeviceTypes.h
similarity index 100%
rename from offload/DeviceRTL/include/DeviceTypes.h
rename to openmp/device/include/DeviceTypes.h
diff --git a/offload/DeviceRTL/include/DeviceUtils.h b/openmp/device/include/DeviceUtils.h
similarity index 100%
rename from offload/DeviceRTL/include/DeviceUtils.h
rename to openmp/device/include/DeviceUtils.h
diff --git a/offload/DeviceRTL/include/Interface.h b/openmp/device/include/Interface.h
similarity index 100%
rename from offload/DeviceRTL/include/Interface.h
rename to openmp/device/include/Interface.h
diff --git a/offload/DeviceRTL/include/LibC.h b/openmp/device/include/LibC.h
similarity index 100%
rename from offload/DeviceRTL/include/LibC.h
rename to openmp/device/include/LibC.h
diff --git a/offload/DeviceRTL/include/Mapping.h b/openmp/device/include/Mapping.h
similarity index 100%
rename from offload/DeviceRTL/include/Mapping.h
rename to openmp/device/include/Mapping.h
diff --git a/offload/DeviceRTL/include/Profiling.h b/openmp/device/include/Profiling.h
similarity index 100%
rename from offload/DeviceRTL/include/Profiling.h
rename to openmp/device/include/Profiling.h
diff --git a/offload/DeviceRTL/include/State.h b/openmp/device/include/State.h
similarity index 100%
rename from offload/Dev...
[truncated]

@jhuber6 jhuber6 force-pushed the OpenMPGPURuntime branch 2 times, most recently from ee6ca95 to 748a7f7 Compare April 22, 2025 17:54
jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Apr 22, 2025
Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with llvm#136729.
@jhuber6 jhuber6 requested a review from Meinersbur April 22, 2025 19:59
Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the LLVM_ENABLE_RUNTIMES-machanism is a great idea.
Regarding the move back to openmp/device, I don't really have an opinion. However, there are some arguments to make:

  1. The same arguments apply to libomptarget as well
  2. Definitions such as those Interface.h are indeed OpenMP-only
  3. Some defintions could be useful for other languages as well, such as Synchronization.h. However, they are also in the ompx namespace

Comment on lines +134 to +136
if("${LLVM_DEFAULT_TARGET_TRIPLE}" MATCHES "^amdgcn|^nvptx" OR
"${CMAKE_CXX_COMPILER_TARGET}" MATCHES "^amdgcn|^nvptx")
add_subdirectory(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[serious] What happens with host offloading? They also need device-like functions such as omp_get_device_num(). The device-side implementation and host-side implementation are different. This also matter when e.g. offloading to a remote cluster (non-GPU) node via MPI.

I don't think we should (or can) assume that the triple determines whether it is executing on the host or device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host offloading uses 'libomp.so'. The way I think about it is that this 'ompdeviceis basicallylibomp` for GPUs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device-side omp_get_device_num() (defined in libomptarget.so, not libomp.so) only returns omp_get_initial_device(), which is wrong for any kind of offloading.

After trying out what actuall happens I found that it actually executes the Fortran wrapper (in libomp.so). It also incorrectly assumes it is always executing on the host. That looks like a bug.

@mgorny
Copy link
Member

mgorny commented Apr 23, 2025

Honestly, I am thoroughly confused about all that openmp ↔ offload moving. But if these don't share much code with the current openmp, perhaps the cleanest approach would be to make it entirely separate?

jhuber6 added a commit that referenced this pull request Apr 23, 2025
Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with #136729.
@jhuber6
Copy link
Contributor Author

jhuber6 commented Apr 23, 2025

I think using the LLVM_ENABLE_RUNTIMES-machanism is a great idea. Regarding the move back to openmp/device, I don't really have an opinion. However, there are some arguments to make:

1. The same arguments apply to `libomptarget` as well

2. Definitions such as those `Interface.h` are indeed OpenMP-only

3. Some defintions could be useful for other languages as well, such as `Synchronization.h`. However, they are also in the `ompx` namespace

Yes, I strongly believe that libomptarget should eventually be moved back into openmp/. Long term I think offload/ should contain the generic 'plugins' that provide an API for offloading to various GPUs. libomptarget then becomes the OpenMP runtime using that interface. There are arguments about some things in the current runtime are generically useful, but my assertion is that these should just be put in a separate library in offload/ if that's the case. Combining everything into a single library is a holdover from before we had the appropriate infrastructure to easily create these, now it's trivial to just make a liboffload.a for the GPU.

Honestly, I am thoroughly confused about all that openmp ↔ offload moving. But if these don't share much code with the current openmp, perhaps the cleanest approach would be to make it entirely separate?

Yeah, it's a little confusing because right now offload/ has a direct dependency on openmp so they're effectively the same project.

Summary:
Currently we build the OpenMP device runtime as part of the `offload/`
project. This is problematic because it has several restrictions when
compared to the normal offloading runtime. It can only be built with an
up-to-date clang and we need to set the target appropriately. Currently
we hack around this by creating the compiler invocation manually, but
this patch moves it into a separate runtimes build.

This follows the same build we use for libc, libc++, compiler-rt, and
flang-rt. This also moves it from `offload/` into `openmp/` because it
is still the `openmp/` runtime and I feel it is more appropriate. We do
want a generic `offload/` library at some point, but it would be trivial
to then add that as a separate library now that we have the
infrastructure that makes adding these new libraries trivial.

This most importantly will require that users update their build
configs, mostly adding the following lines at a minimum. I was debating
whether or not I should 'auto-upgrade' this, but I just went with a
warning.

```
    -DLLVM_RUNTIME_TARGETS='default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda'     \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=openmp \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=openmp \
```

This also changed where the `.bc` version of the library lives, but it's
still created.
@jdoerfert
Copy link
Member

jdoerfert commented Apr 23, 2025

To make one thing clear early on: Standalone, this only introduces cost. There is no tangible benefit from this PR, but a CMake change that will break people. If this is done after other reorganizations have happened, e.g., a generic device RTL is created, this might change, though I am not sure about tangible benefits then either.

Alternative Proposal:

offload/DeviceRTL/generic
offload/DeviceRTL/openmp
offload/DeviceRTL/openacc
offload/DeviceRTL/sycl
offload/DeviceRTL/cuda
...

Now DeviceRTL.openmp.a lives in .../openmp/ and we can guard building it with "OPENMP_IS_ENABLED".

Background:
DeviceRTL.openmp.a is dependent on offload and openmp.
It provides no functionality without both enabled (see the theoretical use case below).
Moving the code to openmp has no direct impact on anyone, except the cmake change.
W/ and w/o the PR we can use a CMake conditional to only build DeviceRTL.openmp.a if openmp is enabled, thus, we can "build what we need" either way.

Upsides of this PR (as I remember them):

  • More OpenMP-dependent code lives inside of openmp.
  • The other theoretical use case is that one could build openmp only, and then "send" the DeviceRTL to someone else. I doubt that it is practical.

Upsides of my proposal:

  • The OpenMP deviceRTL code has no (and likely won't ever have any) connection to the OpenMP host runtime in openmp/ but, at least for now, there are clear connections to code in offload/ (e.g., the global debug flags). This might change once we have a generic part; we should put effort into that first.
  • All device RTLs live together. They will share common code in generic, but also be similar in nature. Having them in one place helps people look around and see how it was solved in language "X", find and refactor common code, etc. One folder for all the GPU device code, one place to look, one place to put things. This is also more flexible: Let's say language X and Y want to share some code, and with this proposal, that is either in generic, or lives in top-level X or top-level Y. We could have it in DeviceRTL/XY-common instead.
  • We have a clear place for the device RTLs for future languages. OpenMP is special since it has a top-level directory, but the others do not, and I don't assume all of them will. All DeviceRTLs will depend on offload, so putting them into offload satisfies one of their dependencies, even if there is a second one, e.g., to openmp.

Now, one could argue DeviceRTLs should not be in offload but maybe compilerRT. Even then, I'd argue you want compiler-rt/DeviceRTL/{openmp,sycl,...} not compiler-rt/{openmp,sycl,...}/DeviceRTL.


[EDIT]

Yeah, it's a little confusing because right now offload/ has a direct dependency on openmp so they're effectively the same project.

This is not true, and I believe we should avoid making such statements:

Offload depends on OpenMP (for now), but OpenMP is useful standalone.
Flang depends on MLIR, but MLIR is useful standalone.
...

Now, should Offload depend on OpenMP: No.
We should invest time to break that dependence, and this PR does not improve the situation. I mentioned this before, DeviceRTL.openmp.a (what is moved here), has no ties to openmp/ but only to offload/. Even if you split the generic parts out, moving the DeviceRTL openmp parts doesn't change the dependence situation at all.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Apr 23, 2025

To make one thing clear early on: Standalone, this only introduces cost. There is no tangible benefit from this PR, but a CMake change that will break people. If this is done after other reorganizations have happened, e.g., a generic device RTL is created, this might change, though I am not sure about tangible benefits then either.

I'm assuming you mean that moving to openmp/ only introduces cost? This PR has a very tangible benefit of decoupling the offload runtime with the GPU runtime builds.

Alternative Proposal:

offload/DeviceRTL/generic
offload/DeviceRTL/openmp
offload/DeviceRTL/openacc
offload/DeviceRTL/sycl
offload/DeviceRTL/cuda
...

Now DeviceRTL.openmp.a lives in .../openmp/ and we can guard building it with "OPENMP_IS_ENABLED".

As I understand, we already have a pretty strong tendency toward the former. We have right now flang-rt, compiler-rt, libclc, and openmp. If this is the direction that LLVM wants then surely we could make language-runtimes/flang/ etc in a similar fashion? I think it's more straightforward that the OpenMP language has its runtime in the OpenMP project.

@jdoerfert
Copy link
Member

jdoerfert commented Apr 23, 2025

I'm assuming you mean that moving to openmp/ only introduces cost?

Yes.

This PR has a very tangible benefit of decoupling the offload runtime with the GPU runtime builds.

Please describe the usage scenario that benefits from this. Keep in mind that we seem to all agree on a generic GPU runtime inside of offload, which has to be split out of what we have right now. So, with this proposal, there will be a GPU runtime in offload and a GPU runtime in openmp, and ...

[EDIT] I was referring to the benefit of the code movement part, not of the separate GPU runtime build part, which can be achieved w/o any code movement at all.

jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request Apr 24, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm#136729 lands.
sylvestre pushed a commit that referenced this pull request Apr 25, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when #136729 lands.

Fixes #136822
@jhuber6
Copy link
Contributor Author

jhuber6 commented Apr 25, 2025

So, I'm assuming there's a reasonable consensus that splitting up the device and host builds is the right way to go. Right now the argument is whether or not this should live in openmp/ or offload.

For historical context, this library used to live in openmp/ until around a year ago, so this isn't new ground. The motivation for moving it is simple; this library provides the OpenMP device runtime so we tell people to build the openmp/ runtime. This obviously isn't critically important, and I'll keep it in offload/ if it's absolutely necessary, but I'd like to get some other opinions.

One argument is that the code in offload/ has ABI dependencies on the interface, so they should stay together. That's true, but it's primarily just conforming to the code OpenMP emits and expects. The dependency chain should be that liboffload provides a generic API and libomptarget inherits from that. All of that ABI code could then be moved back to openmp/. That's not likely to happen soon, but there's no reason that should stop us from doing this now.

Future languages may want their own runtimes. HIP and CUDA have some kind of device runtime libdevice and the ROCm Device Libs currently, but these would probably live somewhere else anyway since they don't really need a runtime. The vast majority of these libraries is already covered by libm and libc. The current status quo is that different language runtimes get their own project, e.g. compiler-rt, flang-rt, libclc, and openmp. We already have openmp/ and it's where it used to live. It makes sense because it contains the runtime implementations of functions the compiler emits when you compile with -fopenmp.

Wanting to share code is somewhat compelling, but there's nothing stopping us from putting generic utility headers in offload/ and including those instead. We already have cross-project headers elsewhere, but I've been trying to move the basic stuff into <gpuintrin.h> for similar reasons, it covers most cases I can think of.

So, I think it should go back in openmp/ as with libomptarget. That makes offoad/ a generic interface that languages inherit from to make their own language runtimes, which I think is how most people expect offload/ to work. Would like to hear some more opinions, if the majority thinks it should stay in offlload/ then people will just need to type -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload instead and we can move on.

@Meinersbur
Copy link
Member

Now, should Offload depend on OpenMP: No. We should invest time to break that dependence, and this PR does not improve the situation. I mentioned this before, DeviceRTL.openmp.a (what is moved here), has no ties to openmp/ but only to offload/. Even if you split the generic parts out, moving the DeviceRTL openmp parts doesn't change the dependence situation at all.

Only because the OpenMP DeviceRTL duplicates definitions such as kmp_sched_t from openmp's kmp.h.

If breaking dependence means copy & pasting shared definitions wholesale then I am strongly against it. This increases the maintanance burdon instead of decreasing it. If you know how to do without, please sketch out you plan.

Offload depends on OpenMP (for now), but OpenMP is useful standalone. Flang depends on MLIR, but MLIR is useful standalone. ...

This should not be about usefulness, but component dependencies. A generic utility library should not contain code that can only be used with only specific project that uses the library, and not have knowledge of the dependent project's internal working even if it is not strictly a dependency due to its definitions just being duplicated.

jyli0116 pushed a commit to jyli0116/llvm-project that referenced this pull request Apr 28, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm#136729 lands.

Fixes llvm#136822
@tahonermann
Copy link
Contributor

As I understand, we already have a pretty strong tendency toward the former. We have right now flang-rt, compiler-rt, libclc, and openmp.

My understanding (which might be incorrect), is that flang-rt and compiler-ft are host-only libraries, libclc is device-only, and openmp has both host and device components with the location of the device-only component being the crux of this discussion. A policy of using top level directories for host-only RTs and the host portions of RTs that span host/device, and placing device-only RT libraries under offload makes sense to me. However...

So, I think it should go back in openmp/ as with libomptarget. That makes offload/ a generic interface that languages inherit from to make their own language runtimes, which I think is how most people expect offload/ to work.

I find this argument compelling as well.

Perhaps it would make sense to keep offload generic and minimal and to co-locate the device RTs under a top level device-rt directory that contains openmp, openacc, cuda, etc...

@jhuber6
Copy link
Contributor Author

jhuber6 commented Apr 28, 2025

As I understand, we already have a pretty strong tendency toward the former. We have right now flang-rt, compiler-rt, libclc, and openmp.

My understanding (which might be incorrect), is that flang-rt and compiler-ft are host-only libraries, libclc is device-only, and openmp has both host and device components with the location of the device-only component being the crux of this discussion. A policy of using top level directories for host-only RTs and the host portions of RTs that span host/device, and placing device-only RT libraries under offload makes sense to me. However...

I don't really like to make a distinction between 'host' and 'device' here. As shown by the libc project, we should be able to treat the GPU as just another target. OpenMP is a little special here because it does enforce different semantics on the host vs. device, but everything else is just some flavor of compiling some utility functions for that target. Wasn't OpenCL designed with execution on CPUs in mind as well? It's probably easier to think of just having some utility library that works correctly w/ cross-compiling.

jhuber6 added a commit to jhuber6/llvm-project that referenced this pull request May 2, 2025
Summary:
Another hacky fix done until
llvm#136729 lands. This time for
`-mcpu`.
@jdoerfert
Copy link
Member

So, I think it should go back in openmp/ as with libomptarget. That makes offload/ a generic interface that languages inherit from to make their own language runtimes, which I think is how most people expect offload/ to work.

I find this argument compelling as well.

Perhaps it would make sense to keep offload generic and minimal and to co-locate the device RTs under a top level device-rt directory that contains openmp, openacc, cuda, etc...

This addresses one of my main concerns: spreading device runtimes all over the place or introducing N new top-level folders. I don't think we want either, but keeping the device code together in a new top-level device-rt directory is, for me, almost as good as having that device-rt folder live under offload. I don't see the benefit of it not being in offload, at least until we have device runtimes that work without offload, or at least have plans to have them. Moving it to openmp will open up the question of where to put the rest, hence my conceptual objection to it. Not to mention that device runtimes have more connection to one another, and to the offload infrastructure, than to their host runtime, at least for now. (Again, there is nothing in DeviceRTL.openmp.a that connects to the openmp folder/host code but, for now, various things that connect to the offload folder/host code.)

@jdoerfert
Copy link
Member

FWIW, this PR contains two conceptual changes, and my objection + comments have all been targeting one of them: the code move.
Wrt. the second change, I support building the device runtimes per triple in a way that aligns more with cross-compiling other runtimes. I understand from @jhuber6 that he bundled them to avoid two cmake changes if both are merged, but that bundling is what, for now, stalls the second part.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 6, 2025
…h (#136754)

Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with llvm/llvm-project#136729.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 6, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm/llvm-project#136729 lands.

Fixes llvm/llvm-project#136822
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
)

Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with llvm#136729.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm#136729 lands.

Fixes llvm#136822
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
)

Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with llvm#136729.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm#136729 lands.

Fixes llvm#136822
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
)

Summary:
This was accidentally kept in the old location when we moved to the
new `lib/<triple>/` location for the DeviceRTL. Move this to reduce the
delta with llvm#136729.
IanWood1 pushed a commit to IanWood1/llvm-project that referenced this pull request May 6, 2025
Summary:
Override the default linker in case the user is passing it separately.
This requires `lld` but it always did. This will be fixed *properly*
when llvm#136729 lands.

Fixes llvm#136822
jhuber6 added a commit that referenced this pull request May 6, 2025
Summary:
Another hacky fix done until
#136729 lands. This time for
`-mcpu`.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 7, 2025
Summary:
Another hacky fix done until
llvm/llvm-project#136729 lands. This time for
`-mcpu`.
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
Summary:
Another hacky fix done until
llvm#136729 lands. This time for
`-mcpu`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category offload openmp:libomp OpenMP host runtime openmp:libomptarget OpenMP offload runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants