Speedup sqllogictests by running long running tests first#20576
Speedup sqllogictests by running long running tests first#20576alamb merged 3 commits intoapache:mainfrom
Conversation
| # Test push down filter | ||
|
|
||
| # Regression test for https://github.com/apache/datafusion/issues/17188 | ||
| query I |
There was a problem hiding this comment.
I experimented with reducing the file generation cost here. Instead of creating two separate parquet files, we can create only t2.parquet and register t1 as a projection of it with just the k column:
CREATE EXTERNAL TABLE t1 (k INT)
STORED AS PARQUET LOCATION '.../t2.parquet';This gave a noticeable speedup locally:
| Baseline (2 files) | Optimized (1 file) | |
|---|---|---|
| Min | 33.000s | 22.653s |
| Max | 37.662s | 25.489s |
| Avg | 34.427s | 24.092s |
One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.
There was a problem hiding this comment.
both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.
THanks @Tim-53 -- that is a great idea.
b2d2635 to
db9d0b6
Compare
db9d0b6 to
7ae3e62
Compare
| /// 9. 0.973s datetime/timestamps.slt | ||
| /// 10. 0.822s cte.slt | ||
| /// ``` | ||
| static TEST_PRIORITY: LazyLock<HashMap<PathBuf, usize>> = LazyLock::new(|| { |
There was a problem hiding this comment.
Would a simpler static slice be sufficient here (eg &[(&str, usize)] with a small helper) instead of LazyLock<HashMap<PathBuf, usize>>?
There was a problem hiding this comment.
I think a slice would be fine too, though then the lookup would potentially be slower (as it would have to resort each) -- I'll leave it as a HashMap unless you feel strongly.
| /// Prioritizes test to run earlier if they are known to be long running (as | ||
| /// each test file itself is run sequentially, but multiple test files are run | ||
| /// in parallel. | ||
| fn sort_tests(mut tests: Vec<TestFile>) -> Vec<TestFile> { |
There was a problem hiding this comment.
Can we add a deterministic tie-breaker in sort_tests (for equal priority) using relative_path, e.g. sort_by_key(|f| (priority, f.relative_path.clone())) to keep run order stable?
This would also benefit from a small unit test covering:
- prioritized files are first,
- non-prioritized files keep deterministic ordering
|
|
||
| /// Default priority for tests not in the TEST_PRIORITY map. Tests with lower | ||
| /// priority values run first. | ||
| static DEFAULT_PRIORITY: usize = 100; |
There was a problem hiding this comment.
nit: can this a const instead of static?
…er than 2 (#20586) Draft as it builds on #20576 ## Which issue does this PR close? - Part of #20524 - Follow on to #20576 from @alamb ## Rationale for this change Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here. ## What changes are included in this PR? Building on #20576 we reuse the needed parquet file for the test instead of recreating it. ## Are these changes tested? Ran the test with following results: | | Baseline (2 files) | Optimized (1 file) | |---|---|---| | Min | 33.000s | 22.653s | | Max | 37.662s | 25.489s | | Avg | 34.427s | 24.092s | One open question: does the correctness of this regression test rely on having two **physically separate** files? The race condition in #17197 was in the execution layer — both scans would still be independent `DataSourceExec` nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use `system cp` to copy the file and register two physical files while still only paying the `generate_series` cost once. ## Are there any user-facing changes?
|
Whoops -- sorry @kosiew -- I will address your comments as a follow on PR |
|
I am really struggling to write tests for this code as it is all in the |
Which issue does this PR close?
push_down_filter.sltinto standalone sqllogictest files to reduce long-tail runtime #20566 from @kosiewRationale for this change
Our sqllogictests harness runs the queries in a single file serially, but runs multiple files in parallel
Right now, the runtime of
cargo test --profile=ci --test sqllogictestsIs domninated by a few long running tests -- so the sooner they are started, the sooner the overall suite finishes
What changes are included in this PR?
Bulding on #20566 from @kosiew adds a heuristic reordering of the tests when run so that the longest running are run first
Are these changes tested?
By CI and I ran performance tests manually
on main
On main this takes 8 seconds
After test split
After #20566 it takes 7 seconds to complete:
This PR
With this PR it takes 5 seconds:
This is actually bounded by the time it takes to run the longest test
push_down_filter_regression.slt:Are there any user-facing changes?