-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Closed
Labels
Description
What happens?
My use case is that I'm working with parquet files that contain a column consisting of large fixed size lists of tiny integers.
I want to process the list columns as arrays or matrices in polars/numpy but also sometimes use duckdb list_* functions.
My problem is that 6GB of parquet files causes ~90GB of temporary data to be staged to disk before the duckdb process crashes on a machine with 32GB RAM.
I'm not sure if I should be adopting a different approach or if I'm trying to do something very odd. Any thoughts would be appreciated.
To Reproduce
Data generation script
# /// script
# dependencies = [
# "polars",
# "numpy"
# ]
# ///
import numpy as np
import polars as pl
n_samples = 500_000
n_variants = 1_000
d = {}
d.setdefault("gts", [])
for i in range(n_variants):
d["gts"].append(
pl.Series(np.random.randint(0, 2, size=(n_samples, 2), dtype=np.uint8))
)
for i in range(20):
(
pl.DataFrame(d)
.with_columns(
pl.col("gts").cast(
pl.Array(pl.Array(pl.UInt8, shape=(2,)), shape=(n_samples,))
)
)
.with_row_index()
.with_columns(pl.lit(f"batch{i}").alias("batch"))
.write_parquet("data/", partition_by=["batch"])
)$ uv run script.py
Failing SQL script
CREATE VIEW parquet_view AS SELECT * FROM read_parquet("data/**/*.parquet", hive_partitioning = true);
SELECT count(batch) AS n_rows FROM parquet_view;
CREATE OR REPLACE TABLE sample_ids AS
SELECT index
FROM parquet_view;
SELECT count(index) FROM sample_ids;
-- start breaking
CREATE OR REPLACE TABLE explode_now AS
SELECT *
FROM parquet_view;
SELECT count(gts) FROM explode_now;OS:
macOS arm64 15.5 (24F74)
DuckDB Version:
v1.3.1 (Ossivalis) 2063dda
DuckDB Client:
CLI
Hardware:
No response
Full Name:
Benjamin Wingfield
Affiliation:
EMBL-EBI
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have