Codestin Search App

iree-3.11.0rc20260322

[Codegen] Add VerifyPipelineConstraints pass for compilation info (ir…

…ee-org#23878)

The pass evaluates pipeline constraints against the actual lowering
config to verify that the tuner constraints are consistent with configs
the compiler produces. In the future, full tuner constraints can also
validate the configs picked by the configuration selection heuristics.

Gated behind `--iree-codegen-experimental-verify-pipeline-constraints`
(off by default, implies `--iree-codegen-add-tuner-attributes`). Phase
ordering: `SelectLoweringStrategy` -> `InsertSMTConstraints` ->
`Verify`, since we need the root ops to be decided already. Constraint
ops are erased regardless of outcome since they only serve verification
and tuning.

Implemented with a simple evaluator; I considered using folders or
dataflow instead, but I like the simplicity of the evaluator.

This is not fully generic yet and only works for llvmgpu constraints. In
the future, we'd need extra interfaces to make it fully generic without
assuming the exact dialects used for compilation info attrs.

Issue: iree-org#23535

---------

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

Mar 21, 2026
662a633
zip
tar.gz

iree-3.11.0rc20260321

[DispatchCreation] Dynamic selection of split reduction target tile s…

…ize for outer reductions (iree-org#23869)

This PR updates the logic to select the target tile size of split
reduction for outer reductions proportional to the total reduction work
to be carried out. This should help select tiles dynamically based on
the problem size to improve parallelism. The proposed decision formula
is derived from experiments on Mi355.

Signed-off-by: Yash Deshpande <[email protected]>

Mar 21, 2026
1fc3d61
zip
tar.gz

iree-3.11.0rc20260320

[Codegen][CAPI] Fix C API assertion for GPU pipeline attributes in Tr…

…anslationInfoAttr (iree-org#23868)

Context: some changes happen from the IREE side: 
- iree-org#23590
- iree-org#23687
- iree-org#23816
and tuner CI error:
https://github.com/nod-ai/amd-shark-ai/actions/runs/23314739415/job/67811065632?pr=2865#step:8:135

This PR fixes the C API assertion in `TranslationInfoAttr.get()` to
accept `PipelineAttr` in addition to `DispatchLoweringPassPipelineAttr`

 Assisted-by:  [Claude Code](https://claude.ai/code)

Signed-off-by: Bangtian Liu <[email protected]>

Mar 20, 2026
0b5ea9f
zip
tar.gz

v3.11.0

Version 3.11.0 release.

Mar 19, 2026
e4a3b04
zip
tar.gz

iree-3.11.0rc20260319

Refactor proactor pool, add frontier-carrying signals, and fix shared…

… infra. (iree-org#23804)

Proactor pool runner factory:
- Extract thread management from proactor_pool into an injectable
runner_factory callback. The pool creates proactors and delegates
poll-driving to a factory, enabling platforms without C threads (wasm,
embedded/RTOS) to use the pool with their own poll mechanisms.
- The thread-based runner moves to a new proactor_thread_runner target.
- _options_default() selects the thread runner on native platforms and
no runner on platforms without threads, so all existing callsites work
without changes.

Frontier-carrying signals:
- Add optional frontier parameter to iree_hal_semaphore_signal() and
iree_hal_semaphore_list_signal() for cross-device causal ordering.
- Update all HAL drivers (local_task, local_sync, Vulkan, CUDA, HIP,
AMDGPU, Metal) and CTS tests to pass the frontier parameter.
- Add FIFO wait elision to semaphore submission tests.

Shared fixes:
- Fix iree_call_once no-op on IREE_SYNCHRONIZATION_DISABLE_UNSAFE
platforms (wasm, bare-metal RISC-V). The single-threaded fallback never
called the init function.
- Guard file_transfer.c queue_copy fast path with DEVICE_VISIBLE check
so HOST_LOCAL-only buffers fall through to the streaming path.
- Add heap_buffer_wrap fallback in memory_file when device import fails.
- Split threaded semaphore CTS tests into semaphore_thread_test.cc so
platforms without C threading can run single-threaded tests.

Co-authored-by: Claude <[email protected]>

Mar 19, 2026
98078db
zip
tar.gz

iree-3.11.0rc20260318

Simplify RISC-V QEMU configuration and make CPU flags configurable (i…

…ree-org#23777)

This change makes two improvements to RISC-V testing infrastructure:

1. Allow toolchain files to control QEMU CPU parameters via
RISCV_QEMU_CPU_FLAGS variable, making it easier to customize CPU
features for testing.

2. Unify QEMU binary configuration by replacing QEMU_RV64_BIN and
QEMU_RV32_BIN with a single QEMU_BIN variable. Since the same build
environment cannot support both riscv64 and riscv32 simultaneously,
having separate variables is unnecessary.

Changes:
- Add RISCV_QEMU_CPU_FLAGS to linux_riscv32.cmake and
linux_riscv64.cmake
- Pass QEMU_CPU_FLAGS environment variable to tests
- Update run_riscv_test.sh to use QEMU_BIN and QEMU_CPU_FLAGS
- Update GitHub workflow to use QEMU_BIN instead of QEMU_RV64_BIN

Signed-off-by: Han-Kuan Chen <[email protected]>

Mar 17, 2026
49d88b4
zip
tar.gz

iree-3.11.0rc20260317

Unify HAL semaphores on async infrastructure. (iree-org#23695)

Many breaking(ish) API changes here: this is the grand unification of
the HAL semaphore mechanism that unlocks heterogeneous execution and
remoting, so it's worth it :) Buffers will be next (those still have
issues and will need some iree_async_region_t work) but at least now
synchronization functions the same across all layers of the stack and
the kernel/devices have the ability to elide device-side waits thanks to
the frontiers (not yet wired, but coming soon). The task system was also
substantially cleaned up and now no longer has a poller thread (or a
63-concurrent-waiter limit). Future changes will continue to optimize
the task system to avoid additional thread hops to reduce CPU latency.

Note, CUDA is untested here beyond building, as I don't have access to a
CUDA machine right now. Anyone with access to one would be appreciated
in filing full reports. Or figuring out how we get a CUDA CI :)

---

This branch replaces IREE's legacy semaphore system — where each HAL
driver implemented its own timeline semaphore from scratch — with a
single `iree_async_semaphore_t` that every driver embeds. The HAL
semaphore becomes a thin shell around a shared, well-tested core.

The result: 388 files changed, **-14,000 net lines**, 115 files deleted.
The codebase gets simpler and more correct at the same time.

### What this does

**Unified type system.** Every HAL semaphore now embeds an
`iree_async_semaphore_t` at offset 0 (toll-free bridge). The async
semaphore owns the timeline value, failure status, timepoint list, and
optional frontier. Driver-specific semaphore types (CUDA events, Vulkan
timeline semaphores, Metal shared events, software semaphores) become
wrappers that add only their hardware-specific signaling on top. This
means timeline tracking, failure propagation, multi-wait, and timepoint
dispatch are written once and shared by all eight backends.

**Centralized multi-wait.** Semaphore wait-any and wait-all are now
implemented once in `iree_hal_semaphore_wait_list`, using the proactor's
native wait primitives. The old approach — where each driver
reimplemented multi-wait with varying degrees of correctness — is gone.
The Vulkan driver gets a dedicated completion watcher thread that
bridges Vulkan's `vkWaitSemaphores` into the async semaphore's timepoint
system, so Vulkan waits participate in the same unified infrastructure
as everyone else.

**Proactor integration.** The async proactor (io_uring / IOCP / kqueue)
is now wired through device creation into every semaphore. This is the
foundation for event-driven scheduling: instead of polling or
busy-waiting on GPU completion, semaphore timepoints can be delivered
through the OS's native async I/O mechanism. A proactor pool manages
per-thread proactor instances so that device creation doesn't require
callers to think about I/O infrastructure.

**Deletion cascade.** With the async semaphore as the single source of
truth, a large amount of legacy infrastructure becomes dead code:
- `semaphore_base.h/c` and the bridge timepoint API (the old
compatibility layer between driver semaphores and the async system)
- `iree_loop_t` and `loop_sync` (moved to VM where it's only needed for
inline module execution)
- `wait_handle`, `event_pool`, `wait_primitive` (replaced by async
primitives)
- The entire `experimental/web/` and `experimental/webgpu/` trees (see
note below)
- `iree_hal_wait_flags_t` (replaced by a clean three-tier `ACTIVE` /
`YIELD` / `BLOCK` model)

### Emscripten / WebGPU

The old web and WebGPU samples are deleted in this branch. They were
built on the emscripten loop, which was built on `wait_handle` and the
old synchronous wait infrastructure — all of which is now gone.

This is intentional, not collateral damage. When emscripten support
comes back, it will be built on the proactor system, which is a far more
natural fit. The browser's event loop is fundamentally a proactor: you
submit work (fetch, GPU dispatch, timer) and get called back when it
completes. The old emscripten loop fought against this by trying to
impose a synchronous polling model on an inherently callback-driven
environment. A proactor backend for emscripten will work *with* the
browser's execution model — `postMessage` for cross-worker signaling,
`requestAnimationFrame` for frame pacing, GPU completion callbacks for
timeline advancement — the same way io_uring works with the Linux kernel
and IOCP works with Windows.

### Why this matters

The old semaphore system was the main obstacle to several things we want
to do:

**Correct error propagation.** Every driver had its own failure
handling, and most of them got edge cases wrong. With a single
implementation, failure status propagates correctly through the entire
pipeline: GPU error → driver callback → async semaphore → fence → user.

**Remote execution.** The frontier/axis system (already landed) needs
semaphores that can be signaled from network events, not just GPU
completions. The unified async semaphore makes this trivial — a
network-backed semaphore is just an `iree_async_semaphore_t` with no
hardware wrapper. Without unification, we'd need to either special-case
remote semaphores in every driver's wait path or build a second parallel
wait infrastructure.

**New driver development.** The AMDGPU driver (in progress) benefits
directly: instead of building semaphore infrastructure from scratch, it
embeds the async semaphore and implements only the HSA-specific
signaling. Same for any future driver.

**Event-driven scheduling.** The proactor integration means we can move
toward a model where the runtime reacts to completions rather than
polling for them. This is necessary for efficient multi-device
orchestration and for keeping CPU utilization low during GPU-bound
workloads.

### API changes

| Area | Old | New | Notes |
|------|-----|-----|-------|
| **Semaphore embedding** | Each driver defines its own semaphore struct
| Embed `iree_async_semaphore_t` at offset 0 | Toll-free bridge:
`(iree_hal_semaphore_t*)async_sem` is valid |
| **Semaphore vtable** | Flat `iree_hal_semaphore_vtable_t` | Embeds
`iree_async_semaphore_vtable_t` at offset 0 | `query()` returns
`uint64_t` directly; `signal()` takes frontier; `fail()` → non-virtual
with `on_fail()` hook |
| **Device creation** | No params struct |
`iree_hal_device_create_params_t` with `proactor_pool` | All
`iree_hal_driver_create_device*` signatures change |
| **Wait flags** | `iree_hal_wait_flags_t` | `iree_async_wait_flags_t` |
Three-tier: `NONE` (block), `YIELD` (brief spin), `ACTIVE` (full spin) |
| **Wait mode** | `iree_hal_wait_mode_t` | `iree_async_wait_mode_t` |
Same values, moved to async layer |
| **Wait primitives** | `iree_wait_primitive_t` |
`iree_async_primitive_t` | Proactor-native; `WAIT_PRIMITIVE` →
`ASYNC_PRIMITIVE` in external timepoint types |
| **Wait source** | `iree_wait_source_ctl_fn_t` dispatch |
`iree_wait_source_resolve_fn_t` | Single function: sync when
`callback=NULL`, async when non-NULL |
| **Loop** | `iree/base/loop.h`, `iree_loop_*` | `iree/vm/loop.h`,
`iree_vm_loop_*` | Mechanical rename; only needed for VM inline
execution now |
| **Multi-wait** | Per-driver `vtable->wait_semaphores()` |
`iree_async_semaphore_multi_wait()` in base layer | Drivers no longer
implement this |
| **Executable cache** | `create_executable_cache(dev, id, loop, out)` |
`create_executable_cache(dev, id, out)` | `iree_loop_t` parameter
removed |

**Deleted**:
- `iree/hal/utils/semaphore_base.h` — bridge timepoint API between
driver semaphores and async system
- `iree/base/wait_handle.h`, `iree/base/event_pool.h` — replaced by
async/proactor infrastructure
- `experimental/web/`, `experimental/webgpu/` — see Emscripten note
above

### Review and verification

The branch was developed iteratively and then rebased into a clean
23-commit sequence. A four-arc cross-validated review (multiple models
with manual verification) covered the full diff. All changes verified
under ASAN on Linux (`//runtime/...`), plus Windows (IOCP) and macOS
(kqueue) for the platform-specific async backends. CUDA and HIP drivers
confirmed to compile.

ci-extra: all

---------

Co-authored-by: Claude <[email protected]>

Mar 17, 2026
d7f5aba
zip
tar.gz

iree-3.11.0rc20260316

[Codgen][ROCm] Fix vector distribution for transposed outputs (iree-o…

…rg#23791)

Layer norm-style dispatches with a multi-output generic that has a
transposed output used to crash with `failed to distribute` on a
proprietary model.

Teach `shouldAttachLoweringConfig` to recognize non-identity output
indexing maps so the op gets a `lowering_config` and proper `to_layout`
anchors.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

Mar 15, 2026
e4a3b04
zip
tar.gz

iree-3.11.0rc20260315

[Codegen][ROCm] Fix crash in complex matmul configuration logic (iree…

…-org#23790)

`getContractionHeuristicSeeds` called
`problem.aType.getIntOrFloatBitWidth()` unconditionally, but aType can
be `complex<f32>` from complex batch matmul dispatches.

Route complex contractions to the SIMT-based setContractConfig fallback
instead. This fixes a crash on a proprietary model.

I checked the numerics against numpy and cpu and rocm is as accurate as
cpu.

Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>

Mar 15, 2026
c578130
zip
tar.gz

iree-3.11.0rc20260314

[Codegen] Apply bounds to subgroup_id (iree-org#23768)

Since we're fixing to start using subgroup_id more often with PCF, apply
bounds to it based on the subgroup size (or sizes) and the number of
threads in the workgroup. Also extend subgroup_size handling to account
for known subgroup sizes instead of giving up completely when there
isn't a fixed choice made yet.

Also fix up some double-spaces in test attributes.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>

Mar 14, 2026
94a0427
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iree-3.11.0rc20260322

iree-3.11.0rc20260321

iree-3.11.0rc20260320

v3.11.0

iree-3.11.0rc20260319

iree-3.11.0rc20260318

iree-3.11.0rc20260317

iree-3.11.0rc20260316

iree-3.11.0rc20260315

iree-3.11.0rc20260314

Tags: minnellf/iree