Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Move execute-parent kernels into session registry#8482

Merged
gatesn merged 17 commits into
developfrom
ngates/session-parent-kernels
Jun 18, 2026
Merged

Move execute-parent kernels into session registry#8482
gatesn merged 17 commits into
developfrom
ngates/session-parent-kernels

Conversation

@gatesn

@gatesn gatesn commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Removes the ScalarFnVTable::execute_parent function in favor of purely session-registered kernels.

We already had some kernels on the session, and some on the scalar function. This unifies everything on the session.

I may well follow up by removing ScalarFnVTable::execute also in favor of these kernels.

A definite follow up is to make the structure of our plugin crates uniform. Sometimes kernels live in compute, kernel or vtable modules.

Closes #8401

@codspeed-hq

codspeed-hq Bot commented Jun 17, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 22.35%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 91 improved benchmarks
❌ 12 regressed benchmarks
✅ 1478 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_10k_random 197.7 µs 253.7 µs -22.08%
Simulation take_10k_contiguous 217.6 µs 274.4 µs -20.68%
Simulation patched_take_10k_contiguous_patches 232.1 µs 288.8 µs -19.64%
Simulation patched_take_10k_random 244.5 µs 301.2 µs -18.83%
Simulation encode_varbin[(1000, 2)] 239.6 µs 294.6 µs -18.67%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 162.7 µs 191.1 µs -14.86%
Simulation varbinview_large 112.6 µs 130.8 µs -13.93%
Simulation decompress_rd[f64, (100000, 0.01)] 890.1 µs 1,020.4 µs -12.77%
Simulation decompress_rd[f64, (100000, 0.1)] 890.2 µs 1,020.4 µs -12.76%
Simulation bench_many_codes_few_values[1024] 467.5 µs 526.3 µs -11.17%
Simulation and_true_constant 14.9 µs 16.8 µs -11.1%
Simulation or_false_constant 14.9 µs 16.8 µs -10.92%
Simulation search_index_mixed_out_of_range 255.9 µs 78.9 µs ×3.2
Simulation search_index_above_max 255.7 µs 78.9 µs ×3.2
Simulation search_index_below_min 255.7 µs 78.9 µs ×3.2
Simulation search_index_full_range_random 255.9 µs 79 µs ×3.2
Simulation search_index_in_range 256.1 µs 79.2 µs ×3.2
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] 185.4 µs 87.5 µs ×2.1
Simulation chunked_bool_canonical_into[(10, 1000)] 751.4 µs 492.8 µs +52.47%
Simulation chunked_opt_bool_canonical_into[(10, 1000)] 866.6 µs 572.3 µs +51.42%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ngates/session-parent-kernels (b7a1dd9) with develop (aef6307)

Open in CodSpeed

@gatesn gatesn force-pushed the ngates/session-parent-kernels branch from 61b7163 to 78d6bfd Compare June 17, 2026 20:43
@gatesn gatesn added the changelog/break A breaking API change label Jun 17, 2026
gatesn added 4 commits June 17, 2026 17:02
Signed-off-by: "Nicholas Gates" <[email protected]>
Signed-off-by: Nicholas Gates <[email protected]>
Initialize direct encoding benchmark sessions with their crate-level kernel registrations so they exercise the session execute-parent path instead of fallback materialization.

Cache the ArrayKernels handle in ExecutionCtx to avoid repeated session lookups while trying execute-parent kernels.

Signed-off-by: "Nicholas Gates" <[email protected]>
Signed-off-by: Nicholas Gates <[email protected]>
@gatesn gatesn force-pushed the ngates/session-parent-kernels branch from 78d6bfd to 644e00d Compare June 17, 2026 21:08
gatesn and others added 9 commits June 17, 2026 17:55
Signed-off-by: Nicholas Gates <[email protected]>
Signed-off-by: "Nicholas Gates" <[email protected]>
…ent-kernels

Signed-off-by: Nicholas Gates <[email protected]>

# Conflicts:
#	encodings/bytebool/src/kernel.rs
#	vortex-array/src/arrays/bool/vtable/kernel.rs
Avoid the per-execution parent kernel cache now that ExecutionCtx already holds the session snapshot, and make patch index lookup use the primitive fast path directly.

Signed-off-by: "Nicholas Gates" <[email protected]>
Store vortex-array's built-in kernel registry on ArraySession so sessions that install the built-in array encodings also get their matching execute-parent kernels.

Signed-off-by: "Nicholas Gates" <[email protected]>
@gatesn gatesn marked this pull request as ready for review June 18, 2026 17:45
@gatesn gatesn requested a review from a team June 18, 2026 17:45
Signed-off-by: Nicholas Gates <[email protected]>
@gatesn

gatesn commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@claude review it

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Claude finished @gatesn's task in 8m 1s —— View job


PR Review: Move execute-parent kernels into session registry

I traced the refactor end-to-end: the removed VTable::execute_parent / DynArrayData::execute_parent, the new register_execute_parent_kernel + DynExecuteParentKernel registry, the executor dispatch, the ArraySession/ArrayKernels wiring, and per-encoding migrations. Overall this is a clean, well-executed unification and the mechanism is sound. A few findings below, none blocking.


Findings

1. Stale developer docs reference the removed ParentKernelSet (docs) — should fix in this PR

docs/developer-guide/internals/execution.md:118 still says:

An encoding declares its parent kernels in a ParentKernelSet, specifying which parent types each kernel handles via a Matcher

ParentKernelSet is fully deleted (no Rust references remain). Encodings now register with session.kernels().register_execute_parent_kernel(parent_id, Child, Kernel). The doc example is also out of date: it shows array: &V::Array, but the trait (vortex-array/src/kernel.rs:41) now takes array: ArrayView<'_, V>. Since this is the canonical "how to add a parent kernel" reference, it's now actively misleading. Fix this →

2. Lost unconditional debug invariant on execute_parent results (low, debug-only)

The deleted DynArrayData::execute_parent had an unconditional debug_assertions guard:

vortex_ensure!(result.len() == parent.len(), "Executed parent canonical length mismatch");
vortex_ensure!(result.dtype() == parent.dtype(), "Executed parent canonical dtype mismatch");

The new dispatch (executor.rs:561 execute_parent_for_child) returns the kernel result directly with no such check. The doc comments on both ExecuteParentFn (optimizer/kernels.rs:89) and the kernel doc still assert "Implementations must preserve the parent's logical length and dtype" — but nothing enforces it at the rewrite site anymore. Downstream pop_frame (executor.rs:514) and finalize_done (executor.rs:542) give partial coverage when a grandparent frame exists, but not for a root parent or all Step-2b (own-child) rewrites. A buggy kernel that returns the wrong length/dtype would now slip past the point it used to be caught. Cheap to restore: re-add the cfg!(debug_assertions) check inside execute_parent_for_child after a kernel returns Some. Fix this →

3. Precedence change for execute-parent overrides (behavioral note, low impact)

Previously the executor tried session-registered execute_parent plugins before the encoding's static PARENT_KERNELS, so a later registration could take precedence over a built-in. Now built-in kernels are themselves registered (at ArraySession::default / encoding initialize time) and additional registrations are appended (register_execute_parent_kernelArcSwapMap::pushextend), so they are tried after built-ins for the same (parent, child) key. This matches the intentional edit to the module doc (which dropped "or kernel" from the precedence claim for execute), so it appears deliberate — but extension authors who relied on overriding a built-in fused execution path can no longer do so by registration order. Worth a one-line note in the register_execute_parent_kernel rustdoc that it cannot override built-ins for an existing key.


Verified (no action needed)

  • Migration completeness: every encoding that previously exposed execute_parent/PARENT_KERNELS now both registers via register_execute_parent_kernel and wires kernel::initialize from its crate initialize (checked alp, alp_rd, bytebool, datetime-parts, decimal-byte-parts, onpair, fastlanes {bitpacking, for, rle}, fsst, parquet-variant, runend, sequence, sparse, zigzag, plus all built-in arrays via arrays::initialize).
  • Registry wiring is correct: ArraySession::default registers built-ins into this.kernels through a temp session holding this.kernels.clone(); ArcSwapMap::clone shares the same Arc<ArcSwap> cell (confirmed by arc_swap_map::clone_shares_the_same_cell and the new array_session_default_registers_builtin_kernels test), so registrations land in the stored registry rather than a throwaway copy.
  • Keying is correct: dispatch uses (parent.encoding_id(), child.encoding_id()), reusing the pre-existing reduce-parent scheme. Scalar-fn arrays report their function id as encoding_id() (ScalarFnArrayPlugin::idself.0.id()), so registrations under Cast.id()/Binary.id()/Filter.id()/etc. resolve as intended. The RegisteredExecuteParentKernel still re-checks K::Parent::try_match(parent), so a non-matching op (e.g. a non-comparison Binary) correctly falls through.
  • Ordering preserved: per-key registration order matches the old ParentKernelSet arrays (e.g. RunEnd Dict: TakeExecuteAdaptor then RunEndTakeFrom; parquet-variant: Filter, Slice, Take, VariantGet).
  • Snapshot timing: ExecutionCtx::new snapshots the execute-parent kernels once per ctx, which is safe because registration happens at session-build time.

Performance

The snapshot-per-ctx approach plus reduced dispatch overhead is reflected in CodSpeed's net +24.5%. The 7 flagged regressions (bitpacking take ~ -16–18%, encode_varbin -20%) are on the Walltime/hosted-runner instrument that CodSpeed itself warns is inconsistent — worth a glance but likely measurement noise rather than a real cost of this change.

Checks not run

This is a read-only review; I did not run cargo build/clippy/tests (CodSpeed confirms the branch compiles and runs). The findings above are static-analysis based.
· ngates/session-parent-kernels

@gatesn gatesn merged commit ed69077 into develop Jun 18, 2026
68 checks passed
@gatesn gatesn deleted the ngates/session-parent-kernels branch June 18, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/break A breaking API change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants