Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@thiagohora
Copy link
Contributor

@thiagohora thiagohora commented Feb 3, 2026

Details

This PR optimizes the SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query in DatasetItemVersionDAO.java by adding project_id filtering to feedback_scores tables to leverage ClickHouse's index structure (workspace_id, project_id, ...).

Key Optimizations

  1. Pre-computed experiment_items_trace_scope CTE - Pre-computes DISTINCT trace_ids for efficient filtering
  2. Pre-computed target_projects CTE - Pre-computes DISTINCT project_ids from traces
  3. Filtered Feedback Scores Queries - Added project_id IN (SELECT project_id FROM target_projects) filtering

Before vs After

Before:

FROM feedback_scores FINAL
WHERE entity_type = 'trace'
  AND workspace_id = :workspace_id
  AND entity_id IN (SELECT trace_id FROM experiment_items_scope)

After:

FROM feedback_scores FINAL
WHERE entity_type = 'trace'
  AND workspace_id = :workspace_id
  AND project_id IN (SELECT project_id FROM target_projects)
  AND entity_id IN (SELECT trace_id FROM experiment_items_trace_scope)

When This Optimization Helps

The optimization is defensive and will show significant improvement when:

  • Workspaces have many projects where experiment traces belong to a small subset
  • The target project contains a small fraction of total feedback_scores
  • At scale, the project_id filtering leverages ClickHouse's index structure

Benchmark Results

Remote Server (1M records)

Metric Optimized Old Notes
Duration ~1,823 ms ~1,840 ms Similar (see analysis below)
Rows Read 22,451,192 22,451,192 Same
Data Read 4.00 GiB 4.00 GiB Same

Note: The benchmark shows minimal improvement because the test dataset's experiment traces belong to the largest project (67.78% of all feedback_scores data). The optimization will show significant gains when traces belong to smaller projects.

Local Server (~2K records)

Metric Optimized Old Difference
Duration ~15 ms ~14 ms Similar
Rows Read 2,165 2,165 Same
Data Read 393 KiB 393 KiB Same

EXPLAIN Index Analysis

Remote Server Index Usage (Both Queries - Identical)

Table Keys Parts Granules
experiment_items workspace_id 6/8 831/1,041
experiments workspace_id, dataset_id, id 1/1 1/1
dataset_item_versions workspace_id, dataset_id 1/6 128/11,348
traces workspace_id, id 4/7 149/4,721

EXPLAIN indexes=1 Output (Remote)

ReadFromMergeTree (opik.experiment_items)
Indexes:
  PrimaryKey
    Keys: workspace_id
    Condition: (workspace_id in ['7596558c-...'])
    Parts: 6/8
    Granules: 831/1041

ReadFromMergeTree (opik.experiments)
Indexes:
  PrimaryKey
    Keys: workspace_id, dataset_id, id
    Condition: and((id in 1-element set), and((dataset_id in [...]), (workspace_id in [...])))
    Parts: 1/1
    Granules: 1/1

ReadFromMergeTree (opik.dataset_item_versions)
Indexes:
  PrimaryKey
    Keys: workspace_id, dataset_id
    Condition: and((dataset_id in [...]), (workspace_id in [...]))
    Parts: 1/6
    Granules: 128/11348

ReadFromMergeTree (opik.traces)
Indexes:
  PrimaryKey
    Keys: workspace_id, id
    Condition: and((id in 1000000-element set), (workspace_id in [...]))
    Parts: 4/7
    Granules: 149/4721

Local Server Index Usage (Both Queries - Identical)

Table Keys Parts Granules
experiment_items workspace_id 1/1 1/1
experiments workspace_id, dataset_id, id 1/2 1/2
dataset_item_versions workspace_id, dataset_id 1/2 1/2
traces workspace_id, id 1/2 1/2

EXPLAIN PIPELINE Analysis

Both queries produce identical execution pipelines with 32 parallel workers (remote) / 12 parallel workers (local):

(Expression)
ExpressionTransform × 32
  (Aggregating)
  Resize 32 → 32
    AggregatingTransform × 32
      (Expression)
      ExpressionTransform × 32
        (Join)
        JoiningTransform × 32 2 → 1
          (Expression)
          ExpressionTransform × 32
            (Join)
            JoiningTransform × 32 2 → 1
              (Expression)
              ExpressionTransform
                (Filter)
                FilterTransform
                  (LimitBy)
                  LimitByTransform
                    (Sorting)
                    MergingSortedTransform 32 → 1
                      MergeSortingTransform × 32
                        (ReadFromMergeTree)
                        MergeTreeSelect(pool: ReadPool, algorithm: Thread) × 32

Key pipeline stages:

  • MergeTreeSelect: Parallel reads from ClickHouse tables
  • JoiningTransform: Hash joins between CTEs
  • LimitByTransform: Deduplication via LIMIT 1 BY
  • AggregatingTransform: COUNT(DISTINCT) aggregation

Data Distribution Analysis

Feedback Scores per Project (Remote)

project_id row_count percentage
019aef68-... (experiment's project) 20,252,668 67.78%
019b9c7d-... 9,609,498 32.16%
Others ~19,804 0.06%

Key insight: All 1,000,000 experiment traces belong to the largest project (67.78% of data), so project_id filtering can only skip 32% of rows.


Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-3846

Testing

  • All 51 DatasetItemVersionDAO tests pass
  • Benchmarked on remote ClickHouse server with 1M records
  • Benchmarked on local ClickHouse with ~2K records
  • Verified both queries return identical results
  • EXPLAIN indexes and EXPLAIN PIPELINE analyzed

Documentation

  • Query optimization follows ClickHouse index best practices
  • Aligns with existing STATS query optimization pattern

Replace JOINs with IN subqueries in SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS
for better ClickHouse query performance:

- Add early trace_data CTE to resolve trace IDs and project_ids upfront
- Replace INNER JOIN experiments_resolved with IN subquery for experiment_items_scope
- Filter feedback_scores and spans by project_id IN (SELECT project_id FROM trace_data)
- Use trace_id IN subquery instead of full workspace scan on traces table

Performance improvements (benchmark with cache disabled):
- 48% faster average execution (219ms vs 420ms)
- 96% more consistent (21ms vs 479ms std dev)
- 83% better worst-case (271ms vs 1567ms max)
- 50% fewer table scans on traces and spans tables
- 38% fewer ReadFromMergeTree operations
- 42% fewer JoiningTransform operations

Also fixes test by setting correct projectName on feedback scores.
… column

- Add experiments_resolved CTE for early experiment resolution
- Add target_projects CTE for project_id index filtering
- Add experiment_items_trace_scope CTE for trace_id filtering
- Filter feedback_scores and spans by project_id and entity_id/trace_id
- Add duration materialized column to traces and spans tables
- Fix feedback_scores_percentiles WHERE clause for correct filtering

Performance improvement on 1M items:
- 3.72x faster (2.91s vs 10.92s)
- 56% fewer rows read (30.84M vs 69.49M)
- 57% less data read (4.58 GiB vs 10.74 GiB)
…T_ITEMS_COUNT query

Add project_id filtering to feedback_scores tables to leverage ClickHouse index structure.
@thiagohora thiagohora requested a review from a team as a code owner February 3, 2026 20:39
@github-actions github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Feb 3, 2026
@thiagohora thiagohora changed the base branch from main to thiaghora/OPIK-3846-optimize-experiment-items-stats-query February 3, 2026 20:42
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

Backend Tests Results

6 635 tests   6 622 ✅  1h 6m 12s ⏱️
  424 suites     13 💤
  424 files        0 ❌

Results for commit ca9b8f1.

♻️ This comment has been updated with latest results.

Base automatically changed from thiaghora/OPIK-3846-optimize-experiment-items-stats-query to main February 4, 2026 10:43
@github-actions github-actions bot removed the tests Including test files, or tests related like configuration. label Feb 4, 2026
@ldaugusto
Copy link
Contributor

Sorry, apparently I'm reviewing your PRs in the reverse order!

@thiagohora thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from cfbae18 to bbfa420 Compare February 4, 2026 12:07
@thiagohora thiagohora requested a review from ldaugusto February 4, 2026 12:08
@thiagohora thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from 24338f1 to 57165f5 Compare February 4, 2026 14:59
@thiagohora thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from 57165f5 to a36b9bb Compare February 4, 2026 15:02
@comet-ml comet-ml deleted a comment from github-actions bot Feb 4, 2026
@comet-ml comet-ml deleted a comment from github-actions bot Feb 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

SDK E2E Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit c4d3b2d.

♻️ This comment has been updated with latest results.

ldaugusto
ldaugusto previously approved these changes Feb 4, 2026
Copy link
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrades are very solid, will be complete with the #5052 and #5054

Let's do the check of which fields are in the experiment items filter for much faster filter resolution in an upcoming PR.

@thiagohora thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from c4d3b2d to 14c9aee Compare February 4, 2026 17:08
@thiagohora thiagohora requested a review from ldaugusto February 4, 2026 17:10
@thiagohora thiagohora merged commit 1b847e7 into main Feb 4, 2026
17 of 18 checks passed
@thiagohora thiagohora deleted the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch February 4, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants