Codestin Search App

alamb · 2026-02-26T13:59:08Z

Which issue does this PR close?

part of Speedup execution of sqllogictests with more parallelization #20524
Follow on to Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime #20566 from @kosiew

Rationale for this change

Our sqllogictests harness runs the queries in a single file serially, but runs multiple files in parallel

Right now, the runtime of

cargo test --profile=ci --test sqllogictests

Is domninated by a few long running tests -- so the sooner they are started, the sooner the overall suite finishes

What changes are included in this PR?

Bulding on #20566 from @kosiew adds a heuristic reordering of the tests when run so that the longest running are run first

Are these changes tested?

By CI and I ran performance tests manually

on main

On main this takes 8 seconds

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.21s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 408 test files in 8 seconds

After test split

After #20566 it takes 7 seconds to complete:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 7 seconds

This PR

With this PR it takes 5 seconds:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
   Compiling datafusion-sqllogictest v52.1.0 (/Users/andrewlamb/Software/datafusion/datafusion/sqllogictest)
    Finished `ci` profile [unoptimized] target(s) in 1.92s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 5 seconds

This is actually bounded by the time it takes to run the longest test push_down_filter_regression.slt:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests -- push_down_filter_regression.slt
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 1 test files in 5 seconds

Are there any user-facing changes?

Tim-53 · 2026-02-26T20:47:26Z

datafusion/sqllogictest/test_files/push_down_filter_regression.slt

+# Test push down filter
+
+# Regression test for https://github.com/apache/datafusion/issues/17188
+query I


I experimented with reducing the file generation cost here. Instead of creating two separate parquet files, we can create only t2.parquet and register t1 as a projection of it with just the k column:

CREATE EXTERNAL TABLE t1 (k INT) STORED AS PARQUET LOCATION '.../t2.parquet';

This gave a noticeable speedup locally:

Baseline (2 files) Optimized (1 file)

Min 33.000s 22.653s

Max 37.662s 25.489s

Avg 34.427s 24.092s

One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

Just filed a follow-up PR for that #20586.

both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

THanks @Tim-53 -- that is a great idea.

…_tests

kosiew · 2026-03-02T06:31:20Z

datafusion/sqllogictest/bin/sqllogictests.rs

+/// 9.    0.973s  datetime/timestamps.slt
+/// 10.    0.822s  cte.slt
+/// ```
+static TEST_PRIORITY: LazyLock<HashMap<PathBuf, usize>> = LazyLock::new(|| {


Would a simpler static slice be sufficient here (eg &[(&str, usize)] with a small helper) instead of LazyLock<HashMap<PathBuf, usize>>?

I think a slice would be fine too, though then the lookup would potentially be slower (as it would have to resort each) -- I'll leave it as a HashMap unless you feel strongly.

kosiew · 2026-03-02T06:34:19Z

datafusion/sqllogictest/bin/sqllogictests.rs

+/// Prioritizes test to run earlier if they are known to be long running (as
+/// each test file itself is run sequentially, but multiple test files are run
+/// in parallel.
+fn sort_tests(mut tests: Vec<TestFile>) -> Vec<TestFile> {


Can we add a deterministic tie-breaker in sort_tests (for equal priority) using relative_path, e.g. sort_by_key(|f| (priority, f.relative_path.clone())) to keep run order stable?

This would also benefit from a small unit test covering:

prioritized files are first,

non-prioritized files keep deterministic ordering

kosiew · 2026-03-02T06:46:44Z

datafusion/sqllogictest/bin/sqllogictests.rs

+
+/// Default priority for tests not in the TEST_PRIORITY map. Tests with lower
+/// priority values run first.
+static DEFAULT_PRIORITY: usize = 100;


nit: can this a const instead of static?

@alamb

…er than 2 (#20586) Draft as it builds on #20576 ## Which issue does this PR close? - Part of #20524 - Follow on to #20576 from @alamb ## Rationale for this change Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here. ## What changes are included in this PR? Building on #20576 we reuse the needed parquet file for the test instead of recreating it. ## Are these changes tested? Ran the test with following results: | | Baseline (2 files) | Optimized (1 file) | |---|---|---| | Min | 33.000s | 22.653s | | Max | 37.662s | 25.489s | | Avg | 34.427s | 24.092s | One open question: does the correctness of this regression test rely on having two **physically separate** files? The race condition in #17197 was in the execution layer — both scans would still be independent `DataSourceExec` nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use `system cp` to copy the file and register two physical files while still only paying the `generate_series` cost once. ## Are there any user-facing changes?

alamb · 2026-03-02T18:02:24Z

Whoops -- sorry @kosiew -- I will address your comments as a follow on PR

alamb · 2026-03-02T18:39:58Z

I am really struggling to write tests for this code as it is all in the bin/sqllogictest.rs file. I'll try a larger refactor

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026

alamb changed the title ~~Alamb/prioritize long tests~~ Speedup sqllogictests by running long running tests first Feb 26, 2026

This was referenced Feb 26, 2026

Split push_down_filter.slt into standalone sqllogictest files to reduce long-tail runtime #20566

Merged

Add deterministic per-file timing summary to sqllogictest runner #20569

Merged

Tim-53 reviewed Feb 26, 2026

View reviewed changes

Tim-53 mentioned this pull request Feb 26, 2026

Improve sqllogicteset speed by creating only a single large file rather than 2 #20586

Merged

alamb force-pushed the alamb/prioritize_long_tests branch from b2d2635 to db9d0b6 Compare February 27, 2026 12:00

github-actions bot removed the optimizer Optimizer rules label Feb 27, 2026

Run long running .slt tests first

7ae3e62

alamb force-pushed the alamb/prioritize_long_tests branch from db9d0b6 to 7ae3e62 Compare February 27, 2026 12:06

alamb marked this pull request as ready for review February 27, 2026 12:06

alamb added 2 commits February 28, 2026 05:54

fix typos

b6c5b36

Merge remote-tracking branch 'apache/main' into alamb/prioritize_long…

1c1db95

…_tests

alamb added the development-process Related to development process of DataFusion label Mar 1, 2026

kosiew approved these changes Mar 2, 2026

View reviewed changes

alamb added this pull request to the merge queue Mar 2, 2026

Merged via the queue into apache:main with commit 02dae77 Mar 2, 2026
30 checks passed

alamb mentioned this pull request Mar 2, 2026

Add tests for sqllogictest prioritization #20656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup sqllogictests by running long running tests first#20576

Speedup sqllogictests by running long running tests first#20576
alamb merged 3 commits intoapache:mainfrom
alamb:alamb/prioritize_long_tests

alamb commented Feb 26, 2026 •

edited

Loading

Uh oh!

Tim-53 Feb 26, 2026

Uh oh!

Tim-53 Feb 26, 2026

Uh oh!

alamb Feb 27, 2026

Uh oh!

kosiew Mar 2, 2026

Uh oh!

alamb Mar 2, 2026

Uh oh!

kosiew Mar 2, 2026

Uh oh!

kosiew Mar 2, 2026

Uh oh!

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alamb commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

on main

After test split

This PR

Are there any user-facing changes?

Uh oh!

Tim-53 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Tim-53 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

alamb commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Feb 26, 2026 •

edited

Loading