Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Speedup sqllogictests by running long running tests first#20576

Merged
alamb merged 3 commits intoapache:mainfrom
alamb:alamb/prioritize_long_tests
Mar 2, 2026
Merged

Speedup sqllogictests by running long running tests first#20576
alamb merged 3 commits intoapache:mainfrom
alamb:alamb/prioritize_long_tests

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 26, 2026

Which issue does this PR close?

Rationale for this change

Our sqllogictests harness runs the queries in a single file serially, but runs multiple files in parallel

Right now, the runtime of

cargo test --profile=ci --test sqllogictests

Is domninated by a few long running tests -- so the sooner they are started, the sooner the overall suite finishes

What changes are included in this PR?

Bulding on #20566 from @kosiew adds a heuristic reordering of the tests when run so that the longest running are run first

Are these changes tested?

By CI and I ran performance tests manually

on main

On main this takes 8 seconds

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.21s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 408 test files in 8 seconds

After test split

After #20566 it takes 7 seconds to complete:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 7 seconds

This PR

With this PR it takes 5 seconds:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests
   Compiling datafusion-sqllogictest v52.1.0 (/Users/andrewlamb/Software/datafusion/datafusion/sqllogictest)
    Finished `ci` profile [unoptimized] target(s) in 1.92s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 411 test files in 5 seconds

This is actually bounded by the time it takes to run the longest test push_down_filter_regression.slt:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo test --profile=ci --test sqllogictests -- push_down_filter_regression.slt
    Finished `ci` profile [unoptimized] target(s) in 0.20s
     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-c4e4be8d5c9fd66e)
Running with 16 test threads (available parallelism: 16)
Completed 1 test files in 5 seconds

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026
@alamb alamb changed the title Alamb/prioritize long tests Speedup sqllogictests by running long running tests first Feb 26, 2026
# Test push down filter

# Regression test for https://github.com/apache/datafusion/issues/17188
query I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with reducing the file generation cost here. Instead of creating two separate parquet files, we can create only t2.parquet and register t1 as a projection of it with just the k column:

CREATE EXTERNAL TABLE t1 (k INT)
STORED AS PARQUET LOCATION '.../t2.parquet';

This gave a noticeable speedup locally:

Baseline (2 files) Optimized (1 file)
Min 33.000s 22.653s
Max 37.662s 25.489s
Avg 34.427s 24.092s

One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just filed a follow-up PR for that #20586.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

THanks @Tim-53 -- that is a great idea.

@alamb alamb force-pushed the alamb/prioritize_long_tests branch from db9d0b6 to 7ae3e62 Compare February 27, 2026 12:06
@alamb alamb marked this pull request as ready for review February 27, 2026 12:06
@alamb alamb added the development-process Related to development process of DataFusion label Mar 1, 2026
/// 9. 0.973s datetime/timestamps.slt
/// 10. 0.822s cte.slt
/// ```
static TEST_PRIORITY: LazyLock<HashMap<PathBuf, usize>> = LazyLock::new(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a simpler static slice be sufficient here (eg &[(&str, usize)] with a small helper) instead of LazyLock<HashMap<PathBuf, usize>>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a slice would be fine too, though then the lookup would potentially be slower (as it would have to resort each) -- I'll leave it as a HashMap unless you feel strongly.

/// Prioritizes test to run earlier if they are known to be long running (as
/// each test file itself is run sequentially, but multiple test files are run
/// in parallel.
fn sort_tests(mut tests: Vec<TestFile>) -> Vec<TestFile> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a deterministic tie-breaker in sort_tests (for equal priority) using relative_path, e.g. sort_by_key(|f| (priority, f.relative_path.clone())) to keep run order stable?

This would also benefit from a small unit test covering:

  • prioritized files are first,
  • non-prioritized files keep deterministic ordering


/// Default priority for tests not in the TEST_PRIORITY map. Tests with lower
/// priority values run first.
static DEFAULT_PRIORITY: usize = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this a const instead of static?

github-merge-queue bot pushed a commit that referenced this pull request Mar 2, 2026
…er than 2 (#20586)

Draft as it builds on #20576


## Which issue does this PR close?
 - Part of #20524
- Follow on to #20576 from
@alamb

## Rationale for this change
Execution time of the test is dominated by the time writing the parquet
files. By reusing the file we can gain around 30% improvement on the
execution time here.

## What changes are included in this PR?

Building on #20576 we reuse the needed parquet file for the test instead
of recreating it.

## Are these changes tested?
Ran the test with following results:

| | Baseline (2 files) | Optimized (1 file) |
|---|---|---|
| Min | 33.000s | 22.653s |
| Max | 37.662s | 25.489s |
| Avg | 34.427s | 24.092s |

One open question: does the correctness of this regression test rely on
having two **physically separate** files? The race condition in #17197
was in the execution layer — both scans would still be independent
`DataSourceExec` nodes with independent readers, so I believe the
behavior is preserved. But if there's any concern, we could use `system
cp` to copy the file and register two physical files while still only
paying the `generate_series` cost once.

## Are there any user-facing changes?
@alamb alamb added this pull request to the merge queue Mar 2, 2026
Merged via the queue into apache:main with commit 02dae77 Mar 2, 2026
30 checks passed
@alamb
Copy link
Contributor Author

alamb commented Mar 2, 2026

Whoops -- sorry @kosiew -- I will address your comments as a follow on PR

@alamb
Copy link
Contributor Author

alamb commented Mar 2, 2026

I am really struggling to write tests for this code as it is all in the bin/sqllogictest.rs file. I'll try a larger refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants