Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@J-Meyers
Copy link
Contributor

@J-Meyers J-Meyers commented Oct 13, 2025

When the contents of all lists within a column within a rowgroup are > int32 max duckdb silently fails to write valid parquet files potentially causing loss of data.

This fix changes both it to loudly fail with an exception in the future, and implements a fix which allows writing files in this case, however the fix goes against a listed comment which I was unsure of the reasoning behind. I failed to find any requirements in the parquet spec for it, and am assuming it is for performance reasons, so only go against the comment when absolutely required.

I added a really, really slow test which is able to replicate the issue, on the current version of duckdb it fails on the read, but not the write with:

Invalid Error:
Out of buffer

However, after just replacing that cast with the checked version it will instead fail on the write, then with the other changes it no longer fails at all, but switching to the checked cast seems like good practice in case there are any other similar issues anywhere else.

It seems that the test was too slow and was getting a SIGTERM (it may also be running out of RAM, I'm not sure)

Here is the relevant test, it's difficult to run all the time because operations with large lists in duckdb are generally quite slow, especially when getting up to 2GB in size:

require parquet

statement ok
CREATE TABLE test_10m_arrays AS
SELECT list_resize([]::TINYINT[], 10000000, 1::TINYINT) AS large_array
FROM range(300);

statement ok
COPY (SELECT * FROM test_10m_arrays) TO '__TEST_DIR__/test_10m_arrays.parquet';

statement ok
DROP TABLE test_10m_arrays;

query I
SELECT large_array.len().avg() AS avg_length FROM '__TEST_DIR__/test_10m_arrays.parquet';
----
10000000.000000

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 14, 2025 03:03
@J-Meyers J-Meyers marked this pull request as ready for review October 14, 2025 03:05
@Mytherin
Copy link
Collaborator

Thanks for the PR!

This was added in #18578 which references #18512

I think the issue here is that we are doing this on the vector level, not at the record level which is the actual requirement in the spec. We should make sure records are aligned on page boundaries, but a vector contains multiple records (and can thus span multiple pages).

CC @lnkuiper

@Mytherin
Copy link
Collaborator

Maybe this PR can be modified to just change the cast to a NumericCast<int32_t> and we can fix the actual issue in a later PR?

@lnkuiper
Copy link
Collaborator

Indeed, Mark summarized it nicely. This will cause us not to follow the Parquet spec anymore, as lists shouldn't span multiple pages. This is considered an invalid Parquet file. DuckDB can read it without issues, but other readers may not, so we cannot make this change. If the cast is changed to a NumericCast instead of an UnsafeNumericCast, this will throw an internal exception instead of silently failing, which is already a lot better.

This should be fixed properly by writing at the record level, but our current write infrastructure is at the vector level, so it's very coarse-grained, and requires batches of 2048 records to fit on a page (instead of a single record).

This is a larger fix, outside of the scope of v1.4-andium, so we'd prefer to just change the cast so it's no longer a silent failure.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 14, 2025 14:09
@J-Meyers J-Meyers marked this pull request as ready for review October 14, 2025 14:09
@J-Meyers
Copy link
Contributor Author

Made the requested changes, MacOS build failure is unrelated:

Successfully downloaded Fix-compatibility-in-package-version-file.patch
Downloading https://github.com/GNOME/libxml2/commit/fe1ee0f25f43e33a9981fd6fe7b0483a8c8b5e8d.diff?full_index=1 -> Add-missing-Bcrypt-link.patch
error: curl: (56) The requested URL returned error: 503
note: If you are using a proxy, please ensure your proxy settings are correct.
Possible causes are:
1. You are actually using an HTTP proxy, but setting HTTPS_PROXY variable to `https//address:port`.
This is not correct, because `https://` prefix claims the proxy is an HTTPS proxy, while your proxy (v2ray, shadowsocksr, etc...) is an HTTP proxy.
Try setting `http://address:port` to both HTTP_PROXY and HTTPS_PROXY instead.
2. If you are using Windows, vcpkg will automatically use your Windows IE Proxy Settings set by your proxy software. See: https://github.com/microsoft/vcpkg-tool/pull/77
The value set by your proxy might be wrong, or have same `https://` prefix issue.
3. Your proxy's remote server is our of service.
If you believe this is not a temporary download server failure and vcpkg needs to be changed to download this file from a different location, please submit an issue to https://github.com/Microsoft/vcpkg/issues
CMake Error at scripts/cmake/vcpkg_download_distfile.cmake:136 (message):
  Download failed, halting portfile.
Call Stack (most recent call first):
  buildtrees/versioning_/versions/libxml2/90c8aae598f04d95b887f5bfd29e24ab1308bbfa/portfile.cmake:7 (vcpkg_download_distfile)
  scripts/ports.cmake:206 (include)


error: building libxml2:arm64-osx-release failed with: BUILD_FAILED

@Mytherin Mytherin merged commit d13e53b into duckdb:v1.4-andium Oct 14, 2025
93 of 94 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@J-Meyers J-Meyers deleted the nested_column_spanning_pages branch October 14, 2025 22:20
@J-Meyers J-Meyers restored the nested_column_spanning_pages branch October 14, 2025 22:20
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Oct 21, 2025
Fixes for CTE (de)serialization compatibility with older versions (duckdb/duckdb#19393)
BUGFIX: Silent failure to write row groups with large lists (duckdb/duckdb#19376)
Throw if non-`VARCHAR` key is passed to `json_object` (duckdb/duckdb#19365)
add test tag support [vfs integration tests p1] (duckdb/duckdb#19331)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Oct 21, 2025
Fixes for CTE (de)serialization compatibility with older versions (duckdb/duckdb#19393)
BUGFIX: Silent failure to write row groups with large lists (duckdb/duckdb#19376)
Throw if non-`VARCHAR` key is passed to `json_object` (duckdb/duckdb#19365)
add test tag support [vfs integration tests p1] (duckdb/duckdb#19331)

Co-authored-by: krlmlr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants