Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bravo1goingdark
Copy link
Owner

Before

  • Every semanticize call reloaded tokenizer/session and semanticize_batch just looped, so batches still ran N single-document inferences.
  • Missing ONNX/tokenizer assets caused hard errors despite the docs claiming a deterministic stub fallback.

After

  • Introduced a per-thread cache that stores tokenizer + ONNX session handles keyed by model path.
  • ONNX mode now performs a single batched inference that pads inputs and reuses the cached session.
  • Errors resolving models/tokenizers fall back to the deterministic stub, matching the documented behavior, and docs/tests were updated accordingly.

Advantages

  • Huge reduction in latency and filesystem thrash for steady-state semantic inference.
  • Batching finally amortizes session setup costs, improving throughput for multi-doc workloads.
  • Pipelines keep running even when assets are missing or temporarily unavailable.

Testing

  • cargo fmt --all
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace --all-targets -- --nocapture

@bravo1goingdark bravo1goingdark self-assigned this Nov 19, 2025
@bravo1goingdark bravo1goingdark added area:semantic Semantic embeddings, ONNX models, or API inference type:performance Profiling, batching, or throughput improvements labels Nov 19, 2025
@bravo1goingdark bravo1goingdark linked an issue Nov 19, 2025 that may be closed by this pull request
@bravo1goingdark bravo1goingdark merged commit 7e862aa into main Nov 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:semantic Semantic embeddings, ONNX models, or API inference type:performance Profiling, batching, or throughput improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache ONNX tokenizer/session handles

2 participants