Codestin Search App

joseph-isaacs · 2026-06-11T14:19:00Z

Summary

Closes: #000

Testing

Evaluate `prefix%` and `%needle%` LIKE patterns directly on OnPair compressed code streams, mirroring the FSST DFA pushdown. Each u16 code is lifted to a byte-level DFA transition (KMP for contains, linear for prefix) by feeding its dictionary token's bytes through the byte table; scanning a row's codes is then one table lookup per code and is exactly equivalent to byte-level matching over the decompressed row. OnPair has no escape code (the trainer always emits all 256 single-byte tokens), so the DFA is strictly simpler than FSST's: no escape sentinel and no escape table. Unsupported pattern shapes (`_`, suffix, ILIKE, needles beyond the u8 state space) return None and fall back to decompression. Wires `LikeExecuteAdaptor(OnPair)` into the parent kernel set. Adds unit tests plus a randomised cross-check against ground-truth starts_with / contains over 600 rows and 14 needles. Signed-off-by: Joe Isaacs <[email protected]>

Add a divan microbenchmark comparing the compressed-domain LIKE pushdown against the decompress-and-match fallback on a 200k-row OnPair-encoded URL column. On this corpus the pushdown is ~1.9-2.2x faster for prefix and ~2.4-3.3x for contains. Two benchmark-enablement knobs: - `VORTEX_ONPAIR_LIKE_PUSHDOWN=0` forces the OnPair LikeKernel to decline (fall back to decompression), so the same binary can A/B the pushdown end-to-end without a rebuild. Read once. - `CLICKBENCH_PARTITIONS=N` caps how many ClickBench shards are fetched and queried, for local/iterative runs (the full suite still defaults to 100). Signed-off-by: Joe Isaacs <[email protected]>

Select the DFA variant once in `OnPairMatcher::scan_to_bitbuf` instead of re-matching the matcher enum per row through a closure, mark the concrete `FlatContainsDfa`/`FlatPrefixDfa::matches` `#[inline]`, and walk row offsets with a running cursor. This lets the row scan monomorphise and inline the DFA step. Controlled microbench (same machine, back-to-back): contains pushdown ~1.16-1.26x faster (e.g. %bonprix% 1.84ms -> 1.46ms), prefix marginally faster. Also add an instrumented characterization test proving where the pushdown actually fires through the execution engine: bare OnPair and Dict(OnPair) both route the predicate to the kernel, but Dict(Shared(OnPair)) -- the shape a dict-encoded column takes when read back from a multi-chunk file -- does not, because `Shared` has no parent-reduce forwarding and canonicalizes (decompresses) instead. This is why the compressed-domain LIKE pushdown does not move end-to-end ClickBench/TPC-H numbers, and it affects FSST identically. Signed-off-by: Joe Isaacs <[email protected]>

A dict-encoded string column reads back as `Dict(codes, Shared(values))`. `Shared` (which dedups the decoded dictionary across row splits) has no parent-reduce forwarding, so a predicate pushed to the values -- `like(Shared(onpair))` -- canonicalizes (decompresses) the source instead of reaching the OnPair/FSST LIKE kernel. Because the filter path's `values_array_uncanonical` reused the projection's `Shared`-wrapped cache, any query that both projects and filters the same column (e.g. ClickBench Q22's `MIN(URL)` + `WHERE URL LIKE`) silently lost the pushdown. Give the predicate path its own bare (non-`Shared`) values cache, built on the same underlying read as the `Shared` projection cache (values are read once). Projection keeps `Shared` for cross-split decode reuse; predicates get bare values so the optimizer can push them into the values encoding. Verified end-to-end on a ClickBench shard (OnPair-encoded `URL`): - Q22-shape (filter + project URL): kernel firings 0 -> 44, query faster. - count(*) filter: still 44 firings, result unchanged. - Q34 (GROUP BY URL, pure decode): unchanged (no decode-cache regression). Also retarget the OnPair characterization test's comment at this layout fix (the array-level `Shared`-blocks-pushdown behavior it pins is what motivates applying predicates to bare values). Signed-off-by: Joe Isaacs <[email protected]>

The per-call DFA table was the dominant cost of the LIKE pushdown on dict-encoded columns (~17% of ClickBench Q21 in a samply profile): it built an `n_states x n_codes` transition for every one of the (up to 4096) dictionary tokens, even though the needle/prefix can only interact with the tokens that contain one of its bytes. A token whose bytes are all absent from the pattern drives the byte table to the same reset state from every *live* state (a non-needle byte falls back to 0 via KMP from any non-accept state; a non-prefix byte fails), and the accept/fail rows are never read because the scan returns the instant it reaches them. So such a token's whole column is just the skip value. Pre-fill the table with the skip value and only compute columns for codes containing a pattern byte; for those, read the token once while advancing all `n_states` start states in lockstep (a per-byte gather). Build-heavy microbench (build + 4k-row scan): ~1.3-1.6x faster, more for rare-byte needles (most tokens skipped), less for common-byte needles like `%google%` on URLs. Randomized ground-truth fuzz test still passes. Signed-off-by: Joe Isaacs <[email protected]>

codspeed-hq · 2026-06-11T14:26:10Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
❌ 1 regressed benchmark
✅ 1530 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	215.3 ns	244.4 ns	-11.93%
⚡	WallTime	`cuda/bitpacked_u8/unpack/3bw[100M]`	352.4 µs	299.7 µs	+17.58%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/relaxed-goodall-e3s5pr (5257888) with develop (0dd6db7)}

github-actions · 2026-06-12T10:28:48Z

Polar Signals Profiling Results

Latest Run

Status	Commit	Job	Attempt	Link
🟢 Done	`5257888`		1	Explore Profiling Data

Powered by Polar Signals Cloud

github-actions · 2026-06-12T10:30:57Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark PolarSignals Profiling failed! Check the workflow run for details.

github-actions · 2026-06-12T10:31:46Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark FineWeb NVMe failed! Check the workflow run for details.

github-actions · 2026-06-12T10:32:11Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark TPC-H SF=1 on NVME failed! Check the workflow run for details.

github-actions · 2026-06-12T10:34:26Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark TPC-DS SF=1 on NVME failed! Check the workflow run for details.

github-actions · 2026-06-12T10:38:17Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark FineWeb S3 failed! Check the workflow run for details.

github-actions · 2026-06-12T10:38:22Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark Statistical and Population Genetics failed! Check the workflow run for details.

github-actions · 2026-06-12T10:41:03Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark TPC-H SF=10 on NVME failed! Check the workflow run for details.

github-actions · 2026-06-12T10:41:58Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark Clickbench on NVME failed! Check the workflow run for details.

github-actions · 2026-06-12T10:44:34Z

BENCHMARK FAILED

Benchmark Random Access failed! Check the workflow run for details.

github-actions · 2026-06-12T10:45:15Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark TPC-H SF=1 on S3 failed! Check the workflow run for details.

github-actions · 2026-06-12T10:47:19Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark TPC-H SF=10 on S3 failed! Check the workflow run for details.

github-actions · 2026-06-12T10:51:09Z

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Benchmark Appian on NVME failed! Check the workflow run for details.

github-actions · 2026-06-28T02:17:04Z

This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days

claude added 5 commits June 9, 2026 14:25

joseph-isaacs changed the title ~~Claude/relaxed goodall e3s5pr~~ do not merge: onpair dfa Jun 11, 2026

joseph-isaacs added the action/benchmark Trigger full benchmarks to run on this PR label Jun 12, 2026

github-actions Bot removed the action/benchmark Trigger full benchmarks to run on this PR label Jun 12, 2026

github-actions Bot added the stale This PR is stale and will be auto-closed soon label Jun 28, 2026

Uh oh!

Conversation

joseph-isaacs commented Jun 11, 2026

Summary

Testing

Uh oh!

codspeed-hq Bot commented Jun 11, 2026

Merging this PR will not alter performance

Performance Changes

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Polar Signals Profiling Results

Latest Run

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

BENCHMARK FAILED

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 12, 2026

🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 12, 2026 •

edited

Loading