[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051

thiagohora · 2026-02-03T20:39:10Z

Details

This PR optimizes the SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query in DatasetItemVersionDAO.java by adding project_id filtering to feedback_scores tables to leverage ClickHouse's index structure (workspace_id, project_id, ...).

Key Optimizations

Pre-computed experiment_items_trace_scope CTE - Pre-computes DISTINCT trace_ids for efficient filtering
Pre-computed target_projects CTE - Pre-computes DISTINCT project_ids from traces
Filtered Feedback Scores Queries - Added project_id IN (SELECT project_id FROM target_projects) filtering

Before vs After

Before:

FROM feedback_scores FINAL
WHERE entity_type = 'trace'
  AND workspace_id = :workspace_id
  AND entity_id IN (SELECT trace_id FROM experiment_items_scope)

After:

FROM feedback_scores FINAL
WHERE entity_type = 'trace'
  AND workspace_id = :workspace_id
  AND project_id IN (SELECT project_id FROM target_projects)
  AND entity_id IN (SELECT trace_id FROM experiment_items_trace_scope)

When This Optimization Helps

The optimization is defensive and will show significant improvement when:

Workspaces have many projects where experiment traces belong to a small subset
The target project contains a small fraction of total feedback_scores
At scale, the project_id filtering leverages ClickHouse's index structure

Benchmark Results

Remote Server (1M records)

Metric	Optimized	Old	Notes
Duration	~1,823 ms	~1,840 ms	Similar (see analysis below)
Rows Read	22,451,192	22,451,192	Same
Data Read	4.00 GiB	4.00 GiB	Same

Note: The benchmark shows minimal improvement because the test dataset's experiment traces belong to the largest project (67.78% of all feedback_scores data). The optimization will show significant gains when traces belong to smaller projects.

Local Server (~2K records)

Metric	Optimized	Old	Difference
Duration	~15 ms	~14 ms	Similar
Rows Read	2,165	2,165	Same
Data Read	393 KiB	393 KiB	Same

EXPLAIN Index Analysis

Remote Server Index Usage (Both Queries - Identical)

Table	Keys	Parts	Granules
`experiment_items`	workspace_id	6/8	831/1,041
`experiments`	workspace_id, dataset_id, id	1/1	1/1
`dataset_item_versions`	workspace_id, dataset_id	1/6	128/11,348
`traces`	workspace_id, id	4/7	149/4,721

EXPLAIN indexes=1 Output (Remote)

ReadFromMergeTree (opik.experiment_items)
Indexes:
  PrimaryKey
    Keys: workspace_id
    Condition: (workspace_id in ['7596558c-...'])
    Parts: 6/8
    Granules: 831/1041

ReadFromMergeTree (opik.experiments)
Indexes:
  PrimaryKey
    Keys: workspace_id, dataset_id, id
    Condition: and((id in 1-element set), and((dataset_id in [...]), (workspace_id in [...])))
    Parts: 1/1
    Granules: 1/1

ReadFromMergeTree (opik.dataset_item_versions)
Indexes:
  PrimaryKey
    Keys: workspace_id, dataset_id
    Condition: and((dataset_id in [...]), (workspace_id in [...]))
    Parts: 1/6
    Granules: 128/11348

ReadFromMergeTree (opik.traces)
Indexes:
  PrimaryKey
    Keys: workspace_id, id
    Condition: and((id in 1000000-element set), (workspace_id in [...]))
    Parts: 4/7
    Granules: 149/4721

Local Server Index Usage (Both Queries - Identical)

Table	Keys	Parts	Granules
`experiment_items`	workspace_id	1/1	1/1
`experiments`	workspace_id, dataset_id, id	1/2	1/2
`dataset_item_versions`	workspace_id, dataset_id	1/2	1/2
`traces`	workspace_id, id	1/2	1/2

EXPLAIN PIPELINE Analysis

Both queries produce identical execution pipelines with 32 parallel workers (remote) / 12 parallel workers (local):

(Expression)
ExpressionTransform × 32
  (Aggregating)
  Resize 32 → 32
    AggregatingTransform × 32
      (Expression)
      ExpressionTransform × 32
        (Join)
        JoiningTransform × 32 2 → 1
          (Expression)
          ExpressionTransform × 32
            (Join)
            JoiningTransform × 32 2 → 1
              (Expression)
              ExpressionTransform
                (Filter)
                FilterTransform
                  (LimitBy)
                  LimitByTransform
                    (Sorting)
                    MergingSortedTransform 32 → 1
                      MergeSortingTransform × 32
                        (ReadFromMergeTree)
                        MergeTreeSelect(pool: ReadPool, algorithm: Thread) × 32

Key pipeline stages:

MergeTreeSelect: Parallel reads from ClickHouse tables
JoiningTransform: Hash joins between CTEs
LimitByTransform: Deduplication via LIMIT 1 BY
AggregatingTransform: COUNT(DISTINCT) aggregation

Data Distribution Analysis

Feedback Scores per Project (Remote)

project_id	row_count	percentage
019aef68-... (experiment's project)	20,252,668	67.78%
019b9c7d-...	9,609,498	32.16%
Others	~19,804	0.06%

Key insight: All 1,000,000 experiment traces belong to the largest project (67.78% of data), so project_id filtering can only skip 32% of rows.

Change checklist

User facing
Documentation update

Issues

OPIK-3846

Testing

All 51 DatasetItemVersionDAO tests pass
Benchmarked on remote ClickHouse server with 1M records
Benchmarked on local ClickHouse with ~2K records
Verified both queries return identical results
EXPLAIN indexes and EXPLAIN PIPELINE analyzed

Documentation

Query optimization follows ClickHouse index best practices
Aligns with existing STATS query optimization pattern

Replace JOINs with IN subqueries in SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS for better ClickHouse query performance: - Add early trace_data CTE to resolve trace IDs and project_ids upfront - Replace INNER JOIN experiments_resolved with IN subquery for experiment_items_scope - Filter feedback_scores and spans by project_id IN (SELECT project_id FROM trace_data) - Use trace_id IN subquery instead of full workspace scan on traces table Performance improvements (benchmark with cache disabled): - 48% faster average execution (219ms vs 420ms) - 96% more consistent (21ms vs 479ms std dev) - 83% better worst-case (271ms vs 1567ms max) - 50% fewer table scans on traces and spans tables - 38% fewer ReadFromMergeTree operations - 42% fewer JoiningTransform operations Also fixes test by setting correct projectName on feedback scores.

…s-stats-query

… column - Add experiments_resolved CTE for early experiment resolution - Add target_projects CTE for project_id index filtering - Add experiment_items_trace_scope CTE for trace_id filtering - Filter feedback_scores and spans by project_id and entity_id/trace_id - Add duration materialized column to traces and spans tables - Fix feedback_scores_percentiles WHERE clause for correct filtering Performance improvement on 1M items: - 3.72x faster (2.91s vs 10.92s) - 56% fewer rows read (30.84M vs 69.49M) - 57% less data read (4.58 GiB vs 10.74 GiB)

…T_ITEMS_COUNT query Add project_id filtering to feedback_scores tables to leverage ClickHouse index structure.

github-actions · 2026-02-03T21:18:06Z

Backend Tests Results

6 635 tests 6 622 ✅ 1h 6m 12s ⏱️
424 suites 13 💤
424 files 0 ❌

Results for commit ca9b8f1.

♻️ This comment has been updated with latest results.

…s-count-query

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java

ldaugusto · 2026-02-04T11:53:34Z

Sorry, apparently I'm reviewing your PRs in the reverse order!

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java

github-actions · 2026-02-04T15:10:50Z

SDK E2E Tests Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit c4d3b2d.

♻️ This comment has been updated with latest results.

apps/opik-backend/src/main/java/com/comet/opik/domain/filter/FilterQueryBuilder.java

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java

ldaugusto

Upgrades are very solid, will be complete with the #5052 and #5054

Let's do the check of which fields are in the experiment items filter for much faster filter resolution in an upcoming PR.

…s-count-query

thiagohora added 5 commits February 2, 2026 20:44

Merge branch 'main' into thiaghora/OPIK-3846-optimize-experiment-item…

ac7a608

…s-stats-query

Fix issue

6f3d753

[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMEN…

c077e94

…T_ITEMS_COUNT query Add project_id filtering to feedback_scores tables to leverage ClickHouse index structure.

thiagohora requested a review from a team as a code owner February 3, 2026 20:39

github-actions bot assigned thiagohora Feb 3, 2026

github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Feb 3, 2026

thiagohora changed the base branch from main to thiaghora/OPIK-3846-optimize-experiment-items-stats-query February 3, 2026 20:42

baz-reviewer bot approved these changes Feb 3, 2026

View reviewed changes

Base automatically changed from thiaghora/OPIK-3846-optimize-experiment-items-stats-query to main February 4, 2026 10:43

Merge branch 'main' into thiaghora/OPIK-3846-optimize-experiment-item…

a4af729

…s-count-query

github-actions bot removed the tests Including test files, or tests related like configuration. label Feb 4, 2026

thiagohora added 2 commits February 4, 2026 11:46

Update 000055_add_duration_column_to_traces_and_spans.sql

046ad92

Update 000055_add_duration_column_to_traces_and_spans.sql

db29439

ldaugusto reviewed Feb 4, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java Outdated Show resolved Hide resolved

Address PR review feedback

bbfa420

thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from cfbae18 to bbfa420 Compare February 4, 2026 12:07

thiagohora requested a review from ldaugusto February 4, 2026 12:08

baz-reviewer bot reviewed Feb 4, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java Outdated Show resolved Hide resolved

thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from 24338f1 to 57165f5 Compare February 4, 2026 14:59

Fix filters

a36b9bb

thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from 57165f5 to a36b9bb Compare February 4, 2026 15:02

comet-ml deleted a comment from github-actions bot Feb 4, 2026

baz-reviewer bot reviewed Feb 4, 2026

View reviewed changes

apps/opik-backend/src/main/java/com/comet/opik/domain/filter/FilterQueryBuilder.java Show resolved Hide resolved

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java Show resolved Hide resolved

ldaugusto previously approved these changes Feb 4, 2026

View reviewed changes

baz-reviewer bot approved these changes Feb 4, 2026

View reviewed changes

Merge branch 'main' into thiaghora/OPIK-3846-optimize-experiment-item…

14c9aee

…s-count-query

thiagohora dismissed ldaugusto’s stale review via 14c9aee February 4, 2026 17:08

thiagohora force-pushed the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch from c4d3b2d to 14c9aee Compare February 4, 2026 17:08

Merge branch 'main' into thiaghora/OPIK-3846-optimize-experiment-item…

ca9b8f1

…s-count-query

thiagohora requested a review from ldaugusto February 4, 2026 17:10

ldaugusto approved these changes Feb 4, 2026

View reviewed changes

thiagohora merged commit 1b847e7 into main Feb 4, 2026
17 of 18 checks passed

thiagohora deleted the thiaghora/OPIK-3846-optimize-experiment-items-count-query branch February 4, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051

[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051

Uh oh!

thiagohora commented Feb 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

ldaugusto commented Feb 4, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ldaugusto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051

[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051

Uh oh!

Conversation

thiagohora commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Key Optimizations

Before vs After

When This Optimization Helps

Benchmark Results

Remote Server (1M records)

Local Server (~2K records)

EXPLAIN Index Analysis

Remote Server Index Usage (Both Queries - Identical)

EXPLAIN indexes=1 Output (Remote)

Local Server Index Usage (Both Queries - Identical)

EXPLAIN PIPELINE Analysis

Data Distribution Analysis

Feedback Scores per Project (Remote)

Change checklist

Issues

Testing

Documentation

Uh oh!

github-actions bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests Results

Uh oh!

Uh oh!

ldaugusto commented Feb 4, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK E2E Tests Results

Uh oh!

Uh oh!

Uh oh!

ldaugusto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thiagohora commented Feb 3, 2026 •

edited

Loading

github-actions bot commented Feb 3, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading