-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS query with CTEs #5052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
thiagohora
merged 14 commits into
main
from
thiaghora/OPIK-3846-optimize-experiment-items-count-query-v2
Feb 4, 2026
Merged
[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS query with CTEs #5052
thiagohora
merged 14 commits into
main
from
thiaghora/OPIK-3846-optimize-experiment-items-count-query-v2
Feb 4, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Replace JOINs with IN subqueries in SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS for better ClickHouse query performance: - Add early trace_data CTE to resolve trace IDs and project_ids upfront - Replace INNER JOIN experiments_resolved with IN subquery for experiment_items_scope - Filter feedback_scores and spans by project_id IN (SELECT project_id FROM trace_data) - Use trace_id IN subquery instead of full workspace scan on traces table Performance improvements (benchmark with cache disabled): - 48% faster average execution (219ms vs 420ms) - 96% more consistent (21ms vs 479ms std dev) - 83% better worst-case (271ms vs 1567ms max) - 50% fewer table scans on traces and spans tables - 38% fewer ReadFromMergeTree operations - 42% fewer JoiningTransform operations Also fixes test by setting correct projectName on feedback scores.
… column - Add experiments_resolved CTE for early experiment resolution - Add target_projects CTE for project_id index filtering - Add experiment_items_trace_scope CTE for trace_id filtering - Filter feedback_scores and spans by project_id and entity_id/trace_id - Add duration materialized column to traces and spans tables - Fix feedback_scores_percentiles WHERE clause for correct filtering Performance improvement on 1M items: - 3.72x faster (2.91s vs 10.92s) - 56% fewer rows read (30.84M vs 69.49M) - 57% less data read (4.58 GiB vs 10.74 GiB)
…T_ITEMS_COUNT query Add project_id filtering to feedback_scores tables to leverage ClickHouse index structure.
…T_ITEMS query with CTEs Add CTEs for trace data pre-computation to reduce I/O on large datasets: - experiment_items_trace_scope: Pre-compute distinct trace_ids - target_projects: Pre-compute project_ids for index usage - trace_data: Pre-compute trace data with deduplication
...ces/liquibase/db-app-analytics/migrations/000055_add_duration_column_to_traces_and_spans.sql
Outdated
Show resolved
Hide resolved
...ces/liquibase/db-app-analytics/migrations/000055_add_duration_column_to_traces_and_spans.sql
Show resolved
Hide resolved
...ces/liquibase/db-app-analytics/migrations/000055_add_duration_column_to_traces_and_spans.sql
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
Contributor
ldaugusto
previously approved these changes
Feb 4, 2026
Contributor
ldaugusto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming #5054 fixes the performance, there are only small things to discuss.
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
57165f5 to
a36b9bb
Compare
Merged
2 tasks
…ry' into thiaghora/OPIK-3846-optimize-experiment-items-count-query-v2
Contributor
c4d3b2d to
14c9aee
Compare
Base automatically changed from
thiaghora/OPIK-3846-optimize-experiment-items-count-query
to
main
February 4, 2026 18:00
ldaugusto
approved these changes
Feb 4, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMSquery inDatasetItemVersionDAO.javausing CTEsOptimization Strategy
The optimized query introduces CTEs to:
experiment_items_trace_scope: Pre-compute distinct trace_ids from experiment itemstarget_projects: Pre-compute distinct project_ids from traces (for potential index usage)trace_data: Pre-compute trace data with deduplication, reused in the final JOINBenchmark Results
Preprod Environment (Large Dataset)
Local Environment (Small Dataset)
Analysis
Preprod (Large Dataset):
Local (Small Dataset):
Change checklist
Issues
Testing
DatasetsResourceTest.javabenchmark_local.sql,benchmark_search.sqlDocumentation