Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Data] write_parquet doesn't respect min_rows_per_file if you specify partition columns #59485

@bveeramani

Description

@bveeramani

What happened + What you expected to happen

If you specify partition columns, Ray Data might produce files with substantially fewer rows than the specified min_rows_per_file. The reason that this happens is because min_rows_per_file determines the minimum number of rows per write task. If there are multiple partitions within a write task, then the number of rows written to each file might be too small, even though the task as a whole wrote more than the minimum number of rows.

I don't know if there's an easy solution to this. Maybe we can emit a warning.

Versions / Dependencies

bd412da

Reproduction script

import os
import ray
import numpy as np
import pandas as pd
import pyarrow.fs

storage_path = "./data"
os.makedirs(storage_path, exist_ok=True)

NUM_ROWS = 100_000_000
YEARS = [2020, 2021, 2022, 2023, 2024]

initial_path = f"{storage_path}/initial_dataset"
partitioned_path = f"{storage_path}/partitioned_dataset"

print(f"Generating dataset with {NUM_ROWS:,} rows...")


def add_columns(batch):
    n = len(batch["id"])
    batch["month"] = np.random.randint(1, 13, size=n)
    batch["year"] = np.random.choice(YEARS, size=n)
    return batch


ds = ray.data.range(NUM_ROWS)
ds = ds.map_batches(add_columns, batch_format="pandas")

print(f"Writing initial dataset to {initial_path}...")
ds.write_parquet(initial_path)
print("Initial dataset written successfully.")

print(f"\nReading dataset from {initial_path}...")
ds_read = ray.data.read_parquet(initial_path)

print(f"Writing partitioned dataset to {partitioned_path}...")
print("Using partition_cols=['month', 'year'] and min_rows_per_file=10,000,000")
ds_read.write_parquet(
    partitioned_path, partition_cols=["month", "year"], min_rows_per_file=10_000_000
)
print("Partitioned dataset written successfully.")

print("\nValidating partitioned output...")
fs = pyarrow.fs.LocalFileSystem()
abs_path = os.path.abspath(partitioned_path)
files = fs.get_file_info(pyarrow.fs.FileSelector(abs_path, recursive=True))
parquet_files = [f.path for f in files if f.path.endswith(".parquet")]

print(f"Found {len(parquet_files)} parquet files")
sample_file = parquet_files[0]
print(f"Reading sample file: {sample_file}")
df = pd.read_parquet(sample_file)
print(f"Number of rows in sample file: {len(df):,}")

print("\nDone!")

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions