Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bench: gate CodSpeed-unstable canonicalization microbenchmarks from simulation#8519

Draft
connortsui20 wants to merge 1 commit into
developfrom
claude/sharp-planck-i4ifv8
Draft

bench: gate CodSpeed-unstable canonicalization microbenchmarks from simulation#8519
connortsui20 wants to merge 1 commit into
developfrom
claude/sharp-planck-i4ifv8

Conversation

@connortsui20

Copy link
Copy Markdown
Member

Summary

A small, recurring set of CodSpeed Simulation microbenchmarks report false-positive regressions in nearly every PR, regardless of what the PR changes. This PR gates the worst, provably-unfixable offenders out of CodSpeed simulation (keeping them runnable via local cargo bench), and documents the rest with a keep/remove decision for each.

Note

This is a draft for discussion. The analysis below justifies every benchmark I touched and explains the ones I deliberately left alone. Happy to dial the scope up or down.

How I found them

I read CodSpeed's bot comment on 9 recent PRs spanning unrelated areas (scalar ops, SIMD take, GPU/CUDA, session refactors, stats rules, random-access splitting, etc.):
#8345, #8454, #8470, #8478, #8483, #8489, #8496, #8500, #8505, #8511.

A benchmark that shows a large change in a PR that doesn't touch its code is, by definition, noise. The tell is when the same unchanged code reports wildly different numbers depending only on which commit it was built against. Every one of these comparisons also carried CodSpeed's own ⚠️ "Different runtime environments detected" warning.

The flaky set (each row = identical code, different measured value)

Benchmark File Observed range (unchanged code) PRs Verdict
decompress_rd[*] encodings/alp/benches/alp_compress.rs f64,(100k) 842 ↔ 1025 µs; (10k) 108 ↔ 139 µs 7/9 Gate
chunked_varbinview_* (×4) vortex-array/benches/chunk_array_builder.rs (1000,10) 161 ↔ 198; (100,100) 223 ↔ 308 µs 6/9 Gate
chunked_bool_canonical_into vortex-array/benches/chunk_array_builder.rs 16 ↔ 35 µs (~2×) 4/9 Gate
take_10k_*, patched_take_10k_* encodings/fastlanes/benches/bitpacking_take.rs bimodal 197 ↔ 255 µs 4/9 Keep (see below)
varbinview_large vortex-array/benches/listview_rebuild.rs 112 ↔ 131 µs 3/9 Keep (this PR)
bitwise_not_vortex_buffer_mut[128] vortex-buffer/benches/vortex_bitbuffer.rs 186 ↔ 244 ns 2/9 Keep (this PR)
sum_i32_nullable_all_valid, chunked_dict_*, encode_varbin*, null_count_run_end, bench_many_codes_few_values various ±34–95% 2/9, only vs base 679e2c5 Keep (lower confidence)

Root cause

CodSpeed Simulation estimates CPU cycles from an instruction trace (Cachegrind-style). That trace is deterministic for a fixed binary in a fixed environment, but it includes instructions executed inside glibc — and memcpy/memmove/memset are resolved at runtime via ifunc to a SIMD variant chosen by the runner's CPU/glibc. When the develop baseline and the PR run on different runner images (which CodSpeed explicitly flags here as "different runtime environments"), any benchmark whose hot path is heap allocation + byte copying changes instruction count even though the Vortex code is byte-identical. See CodSpeed's own write-up, "Why glibc is faster on some GitHub Actions Runners".

This predicts exactly which benchmarks flake, and the data confirms it:

  • decompress_rd flakes; compress_rd never does. Decode materializes a fresh canonical output buffer (alloc + tight SIMD/memcpy); encode is compute-bound (sampling, dictionary build). Same file, same data — only the copy-bound one is noisy.
  • chunked_varbinview_* flakes — canonicalizing chunked strings is concatenating variable-length bytes into one buffer (memcpy-dominated).
  • chunked_bool_canonical_into flakes hardest in relative terms — it's also just too small (~16–35 µs → a few hundred layout-sensitive instructions dominate).

The data movement is the thing being measured, so these cannot be made stable under Simulation by tweaking inputs. docs/developer-guide/benchmarking.md already prescribes the remedy: "Use #[cfg(not(codspeed))] for benchmarks that are incompatible with CodSpeed."

Changes (every gated benchmark justified)

All gated benches stay fully available via local cargo bench — only CodSpeed CI stops tracking them.

encodings/alp/benches/alp_compress.rs

  • decompress_rd → gated. The Fix build after move #1 offender: moved in 7/9 sampled PRs, bidirectionally, 842↔1025 µs for identical code. It decodes to a canonical array, so its instruction count is dominated by output-buffer alloc + memcpy rather than the ALP-RD decode. compress_rd (encode, compute-bound, never flagged) is kept.

vortex-array/benches/chunk_array_builder.rs

  • chunked_varbinview_canonical_into / _into_canonical / _opt_canonical_into / _opt_into_canonical → gated. All four flake across 6/9 PRs; all are memcpy-bound string canonicalization.
  • chunked_bool_canonical_into → gated. Worst relative swings (~2×, 16↔35 µs) and below the Simulation noise floor.
  • Kept: chunked_opt_bool_canonical_into, chunked_opt_bool_into_canonical, chunked_constant_i32_…, chunked_constant_utf8_… — compute-bound, never flagged.

Deliberately kept (flaky ≠ delete)

  • bitpacking_take.rs take_10k_* / patched_take_10k_* — bimodal (±23%), but the in-file comment shows the author tuned the sparse/bitpacked-take thresholds using these exact cases. They measure the core random-access path (real Vortex compute, not glibc), so they're load-bearing. Removing them would lose intended coverage; better to live with the noise floor.
  • varbinview_large, chunked_dict_*, encode_varbin*, bench_many_codes_few_values — same memcpy-bound root cause, but several were only observed against a single base (679e2c5), so I left them for a focused follow-up rather than over-reaching here.
  • bitwise_not_vortex_buffer_mut[128] — sub-µs noise, but its INPUT_SIZE is shared by ~25 valid vortex-vs-arrow comparison benches; not worth restructuring for a ±30 ns wobble.
  • CUDA walltime benches (e.g. cuda/bitpacked_u8/unpack/3bw[100M], swings bidirectionally) — these use the WallTime instrument on non-macro GPU runners (you can't simulate a GPU); that's a separate, intentional trade-off and out of scope.

Verification

Run locally with the pinned nightly + cargo-codspeed, valgrind present:

  • cargo codspeed build (Simulation) succeeds for both vortex-alp and vortex-array (--features _test-harness) — no warnings, no orphaned imports/dead code.
  • cargo codspeed run (Measurement mode: Simulation) executes both suites to completion (exit 0). The gated benches are absent at runtime; compress_rd, chunked_opt_bool_*, and chunked_constant_* still measure.
  • ✅ Binary symbol check confirms the gated benches are compiled out under --cfg codspeed and present without it.
  • ✅ Local cargo bench build path, cargo +nightly fmt --check, and cargo clippy --no-deps on both bench targets are clean. (Full-workspace clippy hits an unrelated pre-existing lint in vortex-buffer under this environment's toolchain; not touched here.)

Follow-ups for maintainers

  1. Archive the gated benches on CodSpeed so they drop off the dashboard instead of showing as "skipped".
  2. The broader class (canonicalization/builder/decode-to-canonical) shares this root cause. The durable fix is infrastructure-level: ensure the develop baseline and PR runs use the same pinned runner image (so "different runtime environments" stops firing), or move these to CodSpeed walltime/macro runners. If that lands, several "kept" benches above could return to Simulation.

Generated by Claude Code

A small set of microbenchmarks report false-positive regressions in nearly
every PR. Their CodSpeed CPU-simulation instruction count is dominated by
output-buffer allocation and glibc `memcpy`/`memmove` (whose `ifunc`-selected
implementation varies across runner images) rather than by Vortex compute, so
they move bidirectionally by 10-90% for unchanged code and CodSpeed flags
"different runtime environments" on the comparisons. They cannot be stabilized
under simulation, so per `docs/developer-guide/benchmarking.md` they are gated
with `#[cfg(not(codspeed))]` and remain available via local `cargo bench`.

Gated from CodSpeed (kept for local runs):
- alp_compress.rs: `decompress_rd` (decode-to-canonical; moved in 7/9 sampled
  PRs, 842-1025 us for identical code). `compress_rd` (encode, compute-bound,
  never flaky) is kept.
- chunk_array_builder.rs: `chunked_varbinview_*` (string canonicalization,
  memcpy-bound; flaky in 6/9 PRs) and `chunked_bool_canonical_into` (also
  below the ~16-35 us noise floor, ~2x swings). The compute-bound
  `chunked_opt_bool_*` and `chunked_constant_*` benches are kept.

Verified: both suites build and run under `cargo codspeed` (Simulation mode),
the gated benches are excluded while the kept benches still execute, and the
local `cargo bench` path, `cargo fmt`, and `cargo clippy` are clean.

Signed-off-by: Claude <[email protected]>
Claude-Session: https://claude.ai/code/session_01GXdjWYp7AbSKwn2bw6GYsf
@codspeed-hq

codspeed-hq Bot commented Jun 20, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 5 improved benchmarks
❌ 4 regressed benchmarks
✅ 1545 untouched benchmarks
⏩ 27 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_10k_random 195.2 µs 253 µs -22.85%
Simulation take_10k_contiguous 215.8 µs 273.6 µs -21.11%
Simulation patched_take_10k_contiguous_patches 229.2 µs 287.9 µs -20.39%
Simulation patched_take_10k_random 241.8 µs 300.6 µs -19.55%
Simulation baseline_eq[4, 65536] 243.6 µs 185.8 µs +31.09%
Simulation baseline_lt[4, 65536] 259 µs 201.2 µs +28.77%
Simulation baseline_eq[16, 65536] 288.5 µs 230.8 µs +25.02%
Simulation baseline_lt[16, 65536] 303.7 µs 245.9 µs +23.52%
Simulation varbinview_large 130.9 µs 112.3 µs +16.62%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/sharp-planck-i4ifv8 (98e41d4) with develop (de60638)

Open in CodSpeed

Footnotes

  1. 27 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants