-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OPIK-3846] [BE] Optimize SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNT query #5051
Conversation
Replace JOINs with IN subqueries in SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_STATS for better ClickHouse query performance: - Add early trace_data CTE to resolve trace IDs and project_ids upfront - Replace INNER JOIN experiments_resolved with IN subquery for experiment_items_scope - Filter feedback_scores and spans by project_id IN (SELECT project_id FROM trace_data) - Use trace_id IN subquery instead of full workspace scan on traces table Performance improvements (benchmark with cache disabled): - 48% faster average execution (219ms vs 420ms) - 96% more consistent (21ms vs 479ms std dev) - 83% better worst-case (271ms vs 1567ms max) - 50% fewer table scans on traces and spans tables - 38% fewer ReadFromMergeTree operations - 42% fewer JoiningTransform operations Also fixes test by setting correct projectName on feedback scores.
… column - Add experiments_resolved CTE for early experiment resolution - Add target_projects CTE for project_id index filtering - Add experiment_items_trace_scope CTE for trace_id filtering - Filter feedback_scores and spans by project_id and entity_id/trace_id - Add duration materialized column to traces and spans tables - Fix feedback_scores_percentiles WHERE clause for correct filtering Performance improvement on 1M items: - 3.72x faster (2.91s vs 10.92s) - 56% fewer rows read (30.84M vs 69.49M) - 57% less data read (4.58 GiB vs 10.74 GiB)
…T_ITEMS_COUNT query Add project_id filtering to feedback_scores tables to leverage ClickHouse index structure.
Backend Tests Results6 635 tests 6 622 ✅ 1h 6m 12s ⏱️ Results for commit ca9b8f1. ♻️ This comment has been updated with latest results. |
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
|
Sorry, apparently I'm reviewing your PRs in the reverse order! |
cfbae18 to
bbfa420
Compare
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
24338f1 to
57165f5
Compare
57165f5 to
a36b9bb
Compare
SDK E2E Tests Results0 tests 0 ✅ 0s ⏱️ Results for commit c4d3b2d. ♻️ This comment has been updated with latest results. |
apps/opik-backend/src/main/java/com/comet/opik/domain/filter/FilterQueryBuilder.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Show resolved
Hide resolved
ldaugusto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c4d3b2d to
14c9aee
Compare
Details
This PR optimizes the
SELECT_DATASET_ITEM_VERSIONS_WITH_EXPERIMENT_ITEMS_COUNTquery inDatasetItemVersionDAO.javaby addingproject_idfiltering to feedback_scores tables to leverage ClickHouse's index structure(workspace_id, project_id, ...).Key Optimizations
experiment_items_trace_scopeCTE - Pre-computes DISTINCT trace_ids for efficient filteringtarget_projectsCTE - Pre-computes DISTINCT project_ids from tracesproject_id IN (SELECT project_id FROM target_projects)filteringBefore vs After
Before:
After:
When This Optimization Helps
The optimization is defensive and will show significant improvement when:
project_idfiltering leverages ClickHouse's index structureBenchmark Results
Remote Server (1M records)
Note: The benchmark shows minimal improvement because the test dataset's experiment traces belong to the largest project (67.78% of all feedback_scores data). The optimization will show significant gains when traces belong to smaller projects.
Local Server (~2K records)
EXPLAIN Index Analysis
Remote Server Index Usage (Both Queries - Identical)
experiment_itemsexperimentsdataset_item_versionstracesEXPLAIN indexes=1 Output (Remote)
Local Server Index Usage (Both Queries - Identical)
experiment_itemsexperimentsdataset_item_versionstracesEXPLAIN PIPELINE Analysis
Both queries produce identical execution pipelines with 32 parallel workers (remote) / 12 parallel workers (local):
Key pipeline stages:
Data Distribution Analysis
Feedback Scores per Project (Remote)
Key insight: All 1,000,000 experiment traces belong to the largest project (67.78% of data), so
project_idfiltering can only skip 32% of rows.Change checklist
Issues
Testing
DatasetItemVersionDAOtests passDocumentation