Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Can't load large list columns from parquet files #18137

@nebfield

Description

@nebfield

What happens?

My use case is that I'm working with parquet files that contain a column consisting of large fixed size lists of tiny integers.

I want to process the list columns as arrays or matrices in polars/numpy but also sometimes use duckdb list_* functions.

My problem is that 6GB of parquet files causes ~90GB of temporary data to be staged to disk before the duckdb process crashes on a machine with 32GB RAM.

I'm not sure if I should be adopting a different approach or if I'm trying to do something very odd. Any thoughts would be appreciated.

To Reproduce

Data generation script

# /// script
# dependencies = [
#   "polars",
#   "numpy"
# ]
# ///

import numpy as np
import polars as pl

n_samples = 500_000
n_variants = 1_000
d = {}
d.setdefault("gts", [])

for i in range(n_variants):
    d["gts"].append(
        pl.Series(np.random.randint(0, 2, size=(n_samples, 2), dtype=np.uint8))
    )


for i in range(20):
    (
        pl.DataFrame(d)
        .with_columns(
            pl.col("gts").cast(
                pl.Array(pl.Array(pl.UInt8, shape=(2,)), shape=(n_samples,))
            )
        )
        .with_row_index()
        .with_columns(pl.lit(f"batch{i}").alias("batch"))
        .write_parquet("data/", partition_by=["batch"])
    )
$ uv run script.py

Failing SQL script

CREATE VIEW parquet_view AS SELECT * FROM read_parquet("data/**/*.parquet", hive_partitioning = true);

SELECT count(batch) AS n_rows FROM parquet_view;

CREATE OR REPLACE TABLE sample_ids AS
SELECT index
FROM parquet_view;

SELECT count(index) FROM sample_ids;

-- start breaking

CREATE OR REPLACE TABLE explode_now AS
SELECT *
FROM parquet_view;

SELECT count(gts) FROM explode_now;

OS:

macOS arm64 15.5 (24F74)

DuckDB Version:

v1.3.1 (Ossivalis) 2063dda

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Benjamin Wingfield

Affiliation:

EMBL-EBI

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions