Tags: minnellf/iree
Tags
[Codegen] Add VerifyPipelineConstraints pass for compilation info (ir… …ee-org#23878) The pass evaluates pipeline constraints against the actual lowering config to verify that the tuner constraints are consistent with configs the compiler produces. In the future, full tuner constraints can also validate the configs picked by the configuration selection heuristics. Gated behind `--iree-codegen-experimental-verify-pipeline-constraints` (off by default, implies `--iree-codegen-add-tuner-attributes`). Phase ordering: `SelectLoweringStrategy` -> `InsertSMTConstraints` -> `Verify`, since we need the root ops to be decided already. Constraint ops are erased regardless of outcome since they only serve verification and tuning. Implemented with a simple evaluator; I considered using folders or dataflow instead, but I like the simplicity of the evaluator. This is not fully generic yet and only works for llvmgpu constraints. In the future, we'd need extra interfaces to make it fully generic without assuming the exact dialects used for compilation info attrs. Issue: iree-org#23535 --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
[DispatchCreation] Dynamic selection of split reduction target tile s… …ize for outer reductions (iree-org#23869) This PR updates the logic to select the target tile size of split reduction for outer reductions proportional to the total reduction work to be carried out. This should help select tiles dynamically based on the problem size to improve parallelism. The proposed decision formula is derived from experiments on Mi355. Signed-off-by: Yash Deshpande <[email protected]>
[Codegen][CAPI] Fix C API assertion for GPU pipeline attributes in Tr… …anslationInfoAttr (iree-org#23868) Context: some changes happen from the IREE side: - iree-org#23590 - iree-org#23687 - iree-org#23816 and tuner CI error: https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135 This PR fixes the C API assertion in `TranslationInfoAttr.get()` to accept `PipelineAttr` in addition to `DispatchLoweringPassPipelineAttr` Assisted-by: [Claude Code](https://claude.ai/code) Signed-off-by: Bangtian Liu <[email protected]>
Refactor proactor pool, add frontier-carrying signals, and fix shared… … infra. (iree-org#23804) Proactor pool runner factory: - Extract thread management from proactor_pool into an injectable runner_factory callback. The pool creates proactors and delegates poll-driving to a factory, enabling platforms without C threads (wasm, embedded/RTOS) to use the pool with their own poll mechanisms. - The thread-based runner moves to a new proactor_thread_runner target. - _options_default() selects the thread runner on native platforms and no runner on platforms without threads, so all existing callsites work without changes. Frontier-carrying signals: - Add optional frontier parameter to iree_hal_semaphore_signal() and iree_hal_semaphore_list_signal() for cross-device causal ordering. - Update all HAL drivers (local_task, local_sync, Vulkan, CUDA, HIP, AMDGPU, Metal) and CTS tests to pass the frontier parameter. - Add FIFO wait elision to semaphore submission tests. Shared fixes: - Fix iree_call_once no-op on IREE_SYNCHRONIZATION_DISABLE_UNSAFE platforms (wasm, bare-metal RISC-V). The single-threaded fallback never called the init function. - Guard file_transfer.c queue_copy fast path with DEVICE_VISIBLE check so HOST_LOCAL-only buffers fall through to the streaming path. - Add heap_buffer_wrap fallback in memory_file when device import fails. - Split threaded semaphore CTS tests into semaphore_thread_test.cc so platforms without C threading can run single-threaded tests. Co-authored-by: Claude <[email protected]>
Simplify RISC-V QEMU configuration and make CPU flags configurable (i… …ree-org#23777) This change makes two improvements to RISC-V testing infrastructure: 1. Allow toolchain files to control QEMU CPU parameters via RISCV_QEMU_CPU_FLAGS variable, making it easier to customize CPU features for testing. 2. Unify QEMU binary configuration by replacing QEMU_RV64_BIN and QEMU_RV32_BIN with a single QEMU_BIN variable. Since the same build environment cannot support both riscv64 and riscv32 simultaneously, having separate variables is unnecessary. Changes: - Add RISCV_QEMU_CPU_FLAGS to linux_riscv32.cmake and linux_riscv64.cmake - Pass QEMU_CPU_FLAGS environment variable to tests - Update run_riscv_test.sh to use QEMU_BIN and QEMU_CPU_FLAGS - Update GitHub workflow to use QEMU_BIN instead of QEMU_RV64_BIN Signed-off-by: Han-Kuan Chen <[email protected]>
Unify HAL semaphores on async infrastructure. (iree-org#23695) Many breaking(ish) API changes here: this is the grand unification of the HAL semaphore mechanism that unlocks heterogeneous execution and remoting, so it's worth it :) Buffers will be next (those still have issues and will need some iree_async_region_t work) but at least now synchronization functions the same across all layers of the stack and the kernel/devices have the ability to elide device-side waits thanks to the frontiers (not yet wired, but coming soon). The task system was also substantially cleaned up and now no longer has a poller thread (or a 63-concurrent-waiter limit). Future changes will continue to optimize the task system to avoid additional thread hops to reduce CPU latency. Note, CUDA is untested here beyond building, as I don't have access to a CUDA machine right now. Anyone with access to one would be appreciated in filing full reports. Or figuring out how we get a CUDA CI :) --- This branch replaces IREE's legacy semaphore system — where each HAL driver implemented its own timeline semaphore from scratch — with a single `iree_async_semaphore_t` that every driver embeds. The HAL semaphore becomes a thin shell around a shared, well-tested core. The result: 388 files changed, **-14,000 net lines**, 115 files deleted. The codebase gets simpler and more correct at the same time. ### What this does **Unified type system.** Every HAL semaphore now embeds an `iree_async_semaphore_t` at offset 0 (toll-free bridge). The async semaphore owns the timeline value, failure status, timepoint list, and optional frontier. Driver-specific semaphore types (CUDA events, Vulkan timeline semaphores, Metal shared events, software semaphores) become wrappers that add only their hardware-specific signaling on top. This means timeline tracking, failure propagation, multi-wait, and timepoint dispatch are written once and shared by all eight backends. **Centralized multi-wait.** Semaphore wait-any and wait-all are now implemented once in `iree_hal_semaphore_wait_list`, using the proactor's native wait primitives. The old approach — where each driver reimplemented multi-wait with varying degrees of correctness — is gone. The Vulkan driver gets a dedicated completion watcher thread that bridges Vulkan's `vkWaitSemaphores` into the async semaphore's timepoint system, so Vulkan waits participate in the same unified infrastructure as everyone else. **Proactor integration.** The async proactor (io_uring / IOCP / kqueue) is now wired through device creation into every semaphore. This is the foundation for event-driven scheduling: instead of polling or busy-waiting on GPU completion, semaphore timepoints can be delivered through the OS's native async I/O mechanism. A proactor pool manages per-thread proactor instances so that device creation doesn't require callers to think about I/O infrastructure. **Deletion cascade.** With the async semaphore as the single source of truth, a large amount of legacy infrastructure becomes dead code: - `semaphore_base.h/c` and the bridge timepoint API (the old compatibility layer between driver semaphores and the async system) - `iree_loop_t` and `loop_sync` (moved to VM where it's only needed for inline module execution) - `wait_handle`, `event_pool`, `wait_primitive` (replaced by async primitives) - The entire `experimental/web/` and `experimental/webgpu/` trees (see note below) - `iree_hal_wait_flags_t` (replaced by a clean three-tier `ACTIVE` / `YIELD` / `BLOCK` model) ### Emscripten / WebGPU The old web and WebGPU samples are deleted in this branch. They were built on the emscripten loop, which was built on `wait_handle` and the old synchronous wait infrastructure — all of which is now gone. This is intentional, not collateral damage. When emscripten support comes back, it will be built on the proactor system, which is a far more natural fit. The browser's event loop is fundamentally a proactor: you submit work (fetch, GPU dispatch, timer) and get called back when it completes. The old emscripten loop fought against this by trying to impose a synchronous polling model on an inherently callback-driven environment. A proactor backend for emscripten will work *with* the browser's execution model — `postMessage` for cross-worker signaling, `requestAnimationFrame` for frame pacing, GPU completion callbacks for timeline advancement — the same way io_uring works with the Linux kernel and IOCP works with Windows. ### Why this matters The old semaphore system was the main obstacle to several things we want to do: **Correct error propagation.** Every driver had its own failure handling, and most of them got edge cases wrong. With a single implementation, failure status propagates correctly through the entire pipeline: GPU error → driver callback → async semaphore → fence → user. **Remote execution.** The frontier/axis system (already landed) needs semaphores that can be signaled from network events, not just GPU completions. The unified async semaphore makes this trivial — a network-backed semaphore is just an `iree_async_semaphore_t` with no hardware wrapper. Without unification, we'd need to either special-case remote semaphores in every driver's wait path or build a second parallel wait infrastructure. **New driver development.** The AMDGPU driver (in progress) benefits directly: instead of building semaphore infrastructure from scratch, it embeds the async semaphore and implements only the HSA-specific signaling. Same for any future driver. **Event-driven scheduling.** The proactor integration means we can move toward a model where the runtime reacts to completions rather than polling for them. This is necessary for efficient multi-device orchestration and for keeping CPU utilization low during GPU-bound workloads. ### API changes | Area | Old | New | Notes | |------|-----|-----|-------| | **Semaphore embedding** | Each driver defines its own semaphore struct | Embed `iree_async_semaphore_t` at offset 0 | Toll-free bridge: `(iree_hal_semaphore_t*)async_sem` is valid | | **Semaphore vtable** | Flat `iree_hal_semaphore_vtable_t` | Embeds `iree_async_semaphore_vtable_t` at offset 0 | `query()` returns `uint64_t` directly; `signal()` takes frontier; `fail()` → non-virtual with `on_fail()` hook | | **Device creation** | No params struct | `iree_hal_device_create_params_t` with `proactor_pool` | All `iree_hal_driver_create_device*` signatures change | | **Wait flags** | `iree_hal_wait_flags_t` | `iree_async_wait_flags_t` | Three-tier: `NONE` (block), `YIELD` (brief spin), `ACTIVE` (full spin) | | **Wait mode** | `iree_hal_wait_mode_t` | `iree_async_wait_mode_t` | Same values, moved to async layer | | **Wait primitives** | `iree_wait_primitive_t` | `iree_async_primitive_t` | Proactor-native; `WAIT_PRIMITIVE` → `ASYNC_PRIMITIVE` in external timepoint types | | **Wait source** | `iree_wait_source_ctl_fn_t` dispatch | `iree_wait_source_resolve_fn_t` | Single function: sync when `callback=NULL`, async when non-NULL | | **Loop** | `iree/base/loop.h`, `iree_loop_*` | `iree/vm/loop.h`, `iree_vm_loop_*` | Mechanical rename; only needed for VM inline execution now | | **Multi-wait** | Per-driver `vtable->wait_semaphores()` | `iree_async_semaphore_multi_wait()` in base layer | Drivers no longer implement this | | **Executable cache** | `create_executable_cache(dev, id, loop, out)` | `create_executable_cache(dev, id, out)` | `iree_loop_t` parameter removed | **Deleted**: - `iree/hal/utils/semaphore_base.h` — bridge timepoint API between driver semaphores and async system - `iree/base/wait_handle.h`, `iree/base/event_pool.h` — replaced by async/proactor infrastructure - `experimental/web/`, `experimental/webgpu/` — see Emscripten note above ### Review and verification The branch was developed iteratively and then rebased into a clean 23-commit sequence. A four-arc cross-validated review (multiple models with manual verification) covered the full diff. All changes verified under ASAN on Linux (`//runtime/...`), plus Windows (IOCP) and macOS (kqueue) for the platform-specific async backends. CUDA and HIP drivers confirmed to compile. ci-extra: all --------- Co-authored-by: Claude <[email protected]>
[Codgen][ROCm] Fix vector distribution for transposed outputs (iree-o… …rg#23791) Layer norm-style dispatches with a multi-output generic that has a transposed output used to crash with `failed to distribute` on a proprietary model. Teach `shouldAttachLoweringConfig` to recognize non-identity output indexing maps so the op gets a `lowering_config` and proper `to_layout` anchors. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
[Codegen][ROCm] Fix crash in complex matmul configuration logic (iree… …-org#23790) `getContractionHeuristicSeeds` called `problem.aType.getIntOrFloatBitWidth()` unconditionally, but aType can be `complex<f32>` from complex batch matmul dispatches. Route complex contractions to the SIMT-based setContractConfig fallback instead. This fixes a crash on a proprietary model. I checked the numerics against numpy and cpu and rocm is as accurate as cpu. Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
[Codegen] Apply bounds to subgroup_id (iree-org#23768) Since we're fixing to start using subgroup_id more often with PCF, apply bounds to it based on the subgroup size (or sizes) and the number of threads in the workgroup. Also extend subgroup_size handling to account for known subgroup sizes instead of giving up completely when there isn't a fixed choice made yet. Also fix up some double-spaces in test attributes. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
PreviousNext