Tags: apache/tvm-ffi
Tags
[OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527) ## Summary Adds an arena-based `JITLinkMemoryManager` that eliminates scattered-mmap relocation overflow in LLVM ORC JIT under ASLR / VA pressure ([LLVM #173269](llvm/llvm-project#173269)), plus a workaround for an x86_64 JITLink GOTPCRELX relaxation bug. Linux only; other platforms fall back to the default `InProcessMemoryManager`. ### Arena memory manager (`orcjit_arena_mm.{h,cc}`) - Pre-reserves one contiguous VA region via `mmap(PROT_NONE | MAP_NORESERVE)` at session startup and bump-allocates from it, guaranteeing all JIT allocations stay within PC-relative range (±2 GB x86_64, ±4 GB AArch64). - Default capacity: 4 GB (x86_64) / 8 GB (AArch64). On reservation failure (RLIMIT_AS, containers) the constructor halves down to a 256 MB floor. - **Dual-pool split.** Arena is partitioned at a 2 MB-aligned midpoint into a non-exec pool (`r--`/`rw-`) and an exec pool (`r-x`). Exec segments pack tightly into whole 2 MB pages for contiguous r-x layout and TLB-friendly huge-page promotion. Both pools are capped so cross-pool Delta32 fixups always resolve inside ±2 GB. - **Slab commit with THP.** Physical pages are committed in 2 MB slabs, matching Linux huge page size. `madvise(MADV_HUGEPAGE)` on the full reservation lets the kernel promote fully-faulted slabs to single TLB entries. - **Overflow sections.** Known large absolute-only sections (`.nv_fatbin`) are routed to separate `mmap()` allocations outside the arena. Guarded by a two-phase check: name-based candidate selection, then edge validation that disqualifies any section targeted by a PC-relative reference. - **Segment-lifetime handling.** `Finalize`-lifetime pages are freed at the end of `finalize()`; `Standard`-lifetime pages remain until `deallocate()`. Free list coalesces adjacent blocks for reuse. - Decommit is deliberately a no-op: `ELFNixPlatform` deinitializers can still reference freed allocations during teardown. Physical pages return to the free list instead; all memory is reclaimed by `munmap` in the arena destructor. ### GOTPCRELX fix plugin (`orcjit_session.cc`) - Works around LLVM JITLink's `optimizeGOTAndStubAccesses()` relaxing `call *foo@GOTPCREL(%rip)` → `addr32 call foo` but tagging the edge as absolute `Pointer32`. On non-PIE executables with symbols in the low 4 GB, this produces a garbage displacement → SIGSEGV during ORC-runtime teardown. - `GOTPCRELXFixPlugin` runs as a `PreFixupPass` after relaxation and either converts to `BranchPCRel32` when the displacement fits, or reverts the relaxation (restores `ff 15`/`ff 25` opcodes, retargets the edge to the GOT entry with `PCRel32`). ### Configuration `ExecutionSession(arena_size=...)` / `arena_size_bytes` C++ arg: `0` = arch default, `>0` = custom size, `<0` = disable arena. Linux-only; ignored on macOS/Windows where the arena is compiled out. ### Tests (`tests/test_arena.py`) 8 arena tests across C/C++/GCC/PIE variants: - `test_arena_colocation` — objects stay within a small window. - `test_arena_keeps_objects_close` — scatter baseline under VA blocker with arena enabled. - `test_arena_hidden_symbol_with_blocker` — ADRP/PC32 cross-object calls resolve under VA pressure. - `test_large_data_section` — 4 MB `.nv_fatbin` loads inside arena when references are absolute. - `test_overflow_section_outside_arena` — `.nv_fatbin` routed to separate mmap, confirmed via address gap. - `test_dso_handle_relocation_after_failed_materialization` — `__dso_handle` resolves after prior sessions leaked slabs. - `test_dso_handle_delta32_with_arena` / `_overflow_without_arena` — `-fpie` GCC objects under 3 GB VA blocker: with arena → passes; without arena → Delta32 overflow. All tests use a 16 MB arena and 256 MB–3 GB VA blockers, safe for CI. ## Test plan - [x] All orcjit tests pass locally on Linux x86_64 and aarch64 - [ ] CI green on Linux x86_64, Linux aarch64, macOS arm64, Windows AMD64 - [x] Non-Linux platforms unaffected (arena compiled out under `#ifdef __linux__`) --------- Co-authored-by: Yaxing Cai <[email protected]>
[OrcJIT] Arena JITLinkMemoryManager with GOTPCRELX fix (Linux) (#527) ## Summary Adds an arena-based `JITLinkMemoryManager` that eliminates scattered-mmap relocation overflow in LLVM ORC JIT under ASLR / VA pressure ([LLVM #173269](llvm/llvm-project#173269)), plus a workaround for an x86_64 JITLink GOTPCRELX relaxation bug. Linux only; other platforms fall back to the default `InProcessMemoryManager`. ### Arena memory manager (`orcjit_arena_mm.{h,cc}`) - Pre-reserves one contiguous VA region via `mmap(PROT_NONE | MAP_NORESERVE)` at session startup and bump-allocates from it, guaranteeing all JIT allocations stay within PC-relative range (±2 GB x86_64, ±4 GB AArch64). - Default capacity: 4 GB (x86_64) / 8 GB (AArch64). On reservation failure (RLIMIT_AS, containers) the constructor halves down to a 256 MB floor. - **Dual-pool split.** Arena is partitioned at a 2 MB-aligned midpoint into a non-exec pool (`r--`/`rw-`) and an exec pool (`r-x`). Exec segments pack tightly into whole 2 MB pages for contiguous r-x layout and TLB-friendly huge-page promotion. Both pools are capped so cross-pool Delta32 fixups always resolve inside ±2 GB. - **Slab commit with THP.** Physical pages are committed in 2 MB slabs, matching Linux huge page size. `madvise(MADV_HUGEPAGE)` on the full reservation lets the kernel promote fully-faulted slabs to single TLB entries. - **Overflow sections.** Known large absolute-only sections (`.nv_fatbin`) are routed to separate `mmap()` allocations outside the arena. Guarded by a two-phase check: name-based candidate selection, then edge validation that disqualifies any section targeted by a PC-relative reference. - **Segment-lifetime handling.** `Finalize`-lifetime pages are freed at the end of `finalize()`; `Standard`-lifetime pages remain until `deallocate()`. Free list coalesces adjacent blocks for reuse. - Decommit is deliberately a no-op: `ELFNixPlatform` deinitializers can still reference freed allocations during teardown. Physical pages return to the free list instead; all memory is reclaimed by `munmap` in the arena destructor. ### GOTPCRELX fix plugin (`orcjit_session.cc`) - Works around LLVM JITLink's `optimizeGOTAndStubAccesses()` relaxing `call *foo@GOTPCREL(%rip)` → `addr32 call foo` but tagging the edge as absolute `Pointer32`. On non-PIE executables with symbols in the low 4 GB, this produces a garbage displacement → SIGSEGV during ORC-runtime teardown. - `GOTPCRELXFixPlugin` runs as a `PreFixupPass` after relaxation and either converts to `BranchPCRel32` when the displacement fits, or reverts the relaxation (restores `ff 15`/`ff 25` opcodes, retargets the edge to the GOT entry with `PCRel32`). ### Configuration `ExecutionSession(arena_size=...)` / `arena_size_bytes` C++ arg: `0` = arch default, `>0` = custom size, `<0` = disable arena. Linux-only; ignored on macOS/Windows where the arena is compiled out. ### Tests (`tests/test_arena.py`) 8 arena tests across C/C++/GCC/PIE variants: - `test_arena_colocation` — objects stay within a small window. - `test_arena_keeps_objects_close` — scatter baseline under VA blocker with arena enabled. - `test_arena_hidden_symbol_with_blocker` — ADRP/PC32 cross-object calls resolve under VA pressure. - `test_large_data_section` — 4 MB `.nv_fatbin` loads inside arena when references are absolute. - `test_overflow_section_outside_arena` — `.nv_fatbin` routed to separate mmap, confirmed via address gap. - `test_dso_handle_relocation_after_failed_materialization` — `__dso_handle` resolves after prior sessions leaked slabs. - `test_dso_handle_delta32_with_arena` / `_overflow_without_arena` — `-fpie` GCC objects under 3 GB VA blocker: with arena → passes; without arena → Delta32 overflow. All tests use a 16 MB arena and 256 MB–3 GB VA blockers, safe for CI. ## Test plan - [x] All orcjit tests pass locally on Linux x86_64 and aarch64 - [ ] CI green on Linux x86_64, Linux aarch64, macOS arm64, Windows AMD64 - [x] Non-Linux platforms unaffected (arena compiled out under `#ifdef __linux__`) --------- Co-authored-by: Yaxing Cai <[email protected]>
[ABI] Add begin_index to TypeAttrColumn (#471) This PR adds a begin_index field to TypeAttrColumn. The begin_index enables the type attributes to store narrowly a range of type indices which can be useful when type attribute is narrowed to specific subscope where objects are allocated continuously so we can optimize for space and locality. As of now the accessor of the TypeAttrColumn is limited to extra/cc so impact is limited. To be careful, we begin_index is set to 0 for next few versions and will migrate to nonzero size in 1.0 (so i64 platform size is compatible)
[FIX] Fix the error propagation in the case of tensor arguments (#409) This PR fixes error propagation in the case of tensor arguments. The bug was previously hidden and revealed after a fix landed in 0.1.8, so it does not impact previous versions. Added a regression test to cover this case.
[CUDA] Isolate unified api to only in cubin launcher (#408) This PR isolates out the unified api to be only local to cubin launcher. Background: it is generally error-prone to mix the driver and runtime API. The particular unified api switch was mainly meant to be used in cubin launcher for a narrow set of cuda versions(around 12.8 ish to 13.0). However, we would like the most generic macros like TVM_FFI_CHECK_CUDA_ERROR to be specific to runtime API. We should revisit if we should simply deprecate driver API usages for better maintainability. --------- Co-authored-by: Junru Shao <[email protected]>
PreviousNext