Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@snkas
Copy link
Contributor

@snkas snkas commented Jan 23, 2026

Tracks for each Bloom filter the number of hits and misses. They are included in the statistics that can be retrieved by the spine. The spine sums up over all the Bloom filter statistics it receives from its batches, which loses insight into individual batch statistics. The spine reports three new metadata fields:

  • "Bloom filter hits": usize integer
  • "Bloom filter misses": usize integer
  • "Bloom filter hit rate": percentage

The Web Console profiler is updated to display the new metadata fields.

PR information:

  • Documentation: not updated
  • Changelog: not updated
  • Backward incompatible changes: only new metadata fields are added, and the Web Console profiler has been set to display it. I've checked that it is able to load profiles from before the new metadata fields are added.
  • Fixes: [dbsp] Report Bloom filter hit rate #5439

@snkas
Copy link
Contributor Author

snkas commented Jan 23, 2026

Using the following SQL:

CREATE TABLE t1 (
    id BIGINT NOT NULL PRIMARY KEY
) WITH (
    'connectors' = '[{
        "name": "gen1",
        "transport": {
            "name": "datagen",
            "config": {
                "plan": [{
                    "limit": 100000000000, 
                    "rate": 1000000,
                    "fields": {
                        "id": { "strategy": "uniform", "range": [0, 10000] }
                    }
                }]
            } 
        }
    }]'
);

... varying runtime configuration min_storage_bytes gives:

  • min_storage_bytes=0: some ~100%, some ~65%, some ~20%, some ~0%
  • min_storage_bytes=1000: some ~100%, some ~63%, some 7%, some 0%
  • min_storage_bytes=10000: some ~100%, several ~60%, one 25%

(The above vary per run)
The most interesting/consistent one to debug/investigate is min_storage_bytes=10000.

Possible alterations to the program is increasing the tuple size by adding:

# Add column:
data VARCHAR NOT NULL

# Add to datagen:
"data": { "strategy": "paragraphs", "range": [0, 100] }

@blp
Copy link
Member

blp commented Jan 23, 2026

I was picturing this being reported as a metric on spines, in AsyncMerger::metadata, rather than through the log. If it's reported in the per-spine metadata then we'll get it in profiles "for free" without having to look through the log.

@blp
Copy link
Member

blp commented Jan 23, 2026

I think I'd use a pair of AtomicU64s instead of Mutexes. They should be cheaper.

@snkas snkas force-pushed the bloom-filter-hit-rate branch from f28c4d6 to 14c154a Compare January 28, 2026 17:18
@snkas snkas marked this pull request as ready for review January 28, 2026 17:26
Copilot AI review requested due to automatic review settings January 28, 2026 17:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces tracking capabilities for Bloom filters used in DBSP batch operations, capturing hit/miss statistics for performance analysis. The changes enable monitoring of Bloom filter effectiveness across the storage layer and make these metrics visible in the Web Console profiler.

Changes:

  • Adds TrackingBloomFilter wrapper around BloomFilter to count hits and misses using atomic counters
  • Refactors filter_size() method to filter_stats() throughout the codebase to return comprehensive statistics
  • Updates the profiler to display three new metadata fields: hits, misses, and hit rate

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
js-packages/profiler-lib/src/profile.ts Adds new Bloom filter metrics to measurement categories
crates/dbsp/src/storage/tracking_bloom_filter.rs Implements new TrackingBloomFilter with hit/miss tracking
crates/dbsp/src/storage/file/writer.rs Updates writer to use TrackingBloomFilter
crates/dbsp/src/storage/file/reader.rs Updates reader to use TrackingBloomFilter and expose stats
crates/dbsp/src/storage/file/format.rs Updates format conversions for TrackingBloomFilter
crates/dbsp/src/trace.rs Changes BatchReader trait method from filter_size() to filter_stats()
crates/dbsp/src/trace/spine_async.rs Aggregates filter stats across batches and adds new metadata fields
crates/dbsp/src/trace/spine_async/snapshot.rs Updates snapshot to use filter_stats()
crates/dbsp/src/trace/test/test_batch.rs Updates test batch to return default stats
crates/dbsp/src/trace/ord/vec/*.rs Updates vec-based batches to return default stats
crates/dbsp/src/trace/ord/file/*.rs Updates file-based batches to delegate to file reader
crates/dbsp/src/trace/ord/fallback/*.rs Updates fallback batches to delegate to inner implementation
crates/dbsp/src/circuit/metadata.rs Adds Float variant to MetaItem enum
crates/dbsp/src/storage.rs Exports new tracking_bloom_filter module

@snkas
Copy link
Contributor Author

snkas commented Jan 28, 2026

Screenshot of profiler:

bloom_filter

@snkas
Copy link
Contributor Author

snkas commented Jan 28, 2026

Ready for review!

"bounds",
"Bloom filter size",
"Bloom filter bits/key"]);
"Bloom filter bits/key",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will need to change once we merge #5514, but let's see which PR lands first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see, this one is current queued, so if it (hopefully) passes CI, the other PR will land after.

@snkas snkas force-pushed the bloom-filter-hit-rate branch from 14c154a to 9e31ddf Compare January 28, 2026 18:14
"Bloom filter bits/key",
"Bloom filter hits",
"Bloom filter misses",
"Bloom filter hit rate"]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hit rate should move below, at the PercentValue case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think you meant line 489 needs to be moved upward, line 380 is the category mapping.

@snkas snkas force-pushed the bloom-filter-hit-rate branch from 9e31ddf to ba8cca4 Compare January 28, 2026 18:17
Copy link
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, thank you

Tracks for each Bloom filter the number of hits and misses. They are
included in the statistics that can be retrieved by the spine. The spine
sums up over all the Bloom filter statistics it receives from its
batches, which loses insight into individual batch statistics. The spine
reports three new metadata fields:

- "Bloom filter hits": usize integer
- "Bloom filter misses": usize integer
- "Bloom filter hit rate": percentage

The Web Console profiler is updated to display the new metadata fields.

Signed-off-by: Simon Kassing <[email protected]>
@snkas snkas force-pushed the bloom-filter-hit-rate branch from ba8cca4 to 13d975d Compare January 28, 2026 18:32
@snkas
Copy link
Contributor Author

snkas commented Jan 28, 2026

Updated screenshot of profiler:

bloom_filter2

@snkas snkas added this pull request to the merge queue Jan 28, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 28, 2026
@snkas snkas added this pull request to the merge queue Jan 29, 2026
Merged via the queue into main with commit 247b89e Jan 29, 2026
1 check passed
@snkas snkas deleted the bloom-filter-hit-rate branch January 29, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[dbsp] Report Bloom filter hit rate

4 participants