Improve sqllogicteset speed by creating only a single large file rather than 2#20586
Conversation
|
Thank you 🙏 I left a note on Asking the original authors if they could double check |
| } | ||
| } | ||
|
|
||
| // trigger ci test |
There was a problem hiding this comment.
Can be removed? Also just in case it's helpful: git commit -m "ci" --allow-empty --no-verify
There was a problem hiding this comment.
(I think this is left over from #20566 -- when this PR gets rebased it should be removed)
adriangb
left a comment
There was a problem hiding this comment.
I don't think the test needs two physically distinct files. As long as it's two different execution nodes that should be good enough!
acce9a4 to
6a36b9e
Compare
alamb
left a comment
There was a problem hiding this comment.
I took the liberty of rebasing this PR against main.
I think it looks good to me
alamb
left a comment
There was a problem hiding this comment.
I took the liberty of rebasing this PR against main.
I think it looks good to me
|
Thanks again @Tim-53 |
Draft as it builds on #20576
Which issue does this PR close?
Rationale for this change
Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here.
What changes are included in this PR?
Building on #20576 we reuse the needed parquet file for the test instead of recreating it.
Are these changes tested?
Ran the test with following results:
One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent
DataSourceExecnodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could usesystem cpto copy the file and register two physical files while still only paying thegenerate_seriescost once.Are there any user-facing changes?