Can't load large list columns from parquet files

### What happens?

My use case is that I'm working with parquet files that contain a column consisting of large fixed size lists of tiny integers.

I want to process the list columns as arrays or matrices in polars/numpy but also sometimes use duckdb `list_*` functions.

My problem is that 6GB of parquet files causes ~90GB of temporary data to be staged to disk before the duckdb process crashes on a machine with 32GB RAM. 

I'm not sure if I should be adopting a different approach or if I'm trying to do something very odd. Any thoughts would be appreciated. 

### To Reproduce

## Data generation script

```python
# /// script
# dependencies = [
#   "polars",
#   "numpy"
# ]
# ///

import numpy as np
import polars as pl

n_samples = 500_000
n_variants = 1_000
d = {}
d.setdefault("gts", [])

for i in range(n_variants):
    d["gts"].append(
        pl.Series(np.random.randint(0, 2, size=(n_samples, 2), dtype=np.uint8))
    )


for i in range(20):
    (
        pl.DataFrame(d)
        .with_columns(
            pl.col("gts").cast(
                pl.Array(pl.Array(pl.UInt8, shape=(2,)), shape=(n_samples,))
            )
        )
        .with_row_index()
        .with_columns(pl.lit(f"batch{i}").alias("batch"))
        .write_parquet("data/", partition_by=["batch"])
    )
```

```
$ uv run script.py
```

## Failing SQL script

```sql
CREATE VIEW parquet_view AS SELECT * FROM read_parquet("data/**/*.parquet", hive_partitioning = true);

SELECT count(batch) AS n_rows FROM parquet_view;

CREATE OR REPLACE TABLE sample_ids AS
SELECT index
FROM parquet_view;

SELECT count(index) FROM sample_ids;

-- start breaking

CREATE OR REPLACE TABLE explode_now AS
SELECT *
FROM parquet_view;

SELECT count(gts) FROM explode_now;
```

### OS:

macOS arm64 15.5 (24F74)

### DuckDB Version:

v1.3.1 (Ossivalis) 2063dda3e6

### DuckDB Client:

CLI

### Hardware:

_No response_

### Full Name:

Benjamin Wingfield

### Affiliation:

EMBL-EBI

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

### Did you include all relevant data sets for reproducing the issue?

Yes

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't load large list columns from parquet files #18137

What happens?

To Reproduce

Data generation script

Failing SQL script

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't load large list columns from parquet files #18137

Description

What happens?

To Reproduce

Data generation script

Failing SQL script

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions