-
Notifications
You must be signed in to change notification settings - Fork 7k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues
Description
What happened + What you expected to happen
If you specify partition columns, Ray Data might produce files with substantially fewer rows than the specified min_rows_per_file. The reason that this happens is because min_rows_per_file determines the minimum number of rows per write task. If there are multiple partitions within a write task, then the number of rows written to each file might be too small, even though the task as a whole wrote more than the minimum number of rows.
I don't know if there's an easy solution to this. Maybe we can emit a warning.
Versions / Dependencies
Reproduction script
import os
import ray
import numpy as np
import pandas as pd
import pyarrow.fs
storage_path = "./data"
os.makedirs(storage_path, exist_ok=True)
NUM_ROWS = 100_000_000
YEARS = [2020, 2021, 2022, 2023, 2024]
initial_path = f"{storage_path}/initial_dataset"
partitioned_path = f"{storage_path}/partitioned_dataset"
print(f"Generating dataset with {NUM_ROWS:,} rows...")
def add_columns(batch):
n = len(batch["id"])
batch["month"] = np.random.randint(1, 13, size=n)
batch["year"] = np.random.choice(YEARS, size=n)
return batch
ds = ray.data.range(NUM_ROWS)
ds = ds.map_batches(add_columns, batch_format="pandas")
print(f"Writing initial dataset to {initial_path}...")
ds.write_parquet(initial_path)
print("Initial dataset written successfully.")
print(f"\nReading dataset from {initial_path}...")
ds_read = ray.data.read_parquet(initial_path)
print(f"Writing partitioned dataset to {partitioned_path}...")
print("Using partition_cols=['month', 'year'] and min_rows_per_file=10,000,000")
ds_read.write_parquet(
partitioned_path, partition_cols=["month", "year"], min_rows_per_file=10_000_000
)
print("Partitioned dataset written successfully.")
print("\nValidating partitioned output...")
fs = pyarrow.fs.LocalFileSystem()
abs_path = os.path.abspath(partitioned_path)
files = fs.get_file_info(pyarrow.fs.FileSelector(abs_path, recursive=True))
parquet_files = [f.path for f in files if f.path.endswith(".parquet")]
print(f"Found {len(parquet_files)} parquet files")
sample_file = parquet_files[0]
print(f"Reading sample file: {sample_file}")
df = pd.read_parquet(sample_file)
print(f"Number of rows in sample file: {len(df):,}")
print("\nDone!")Issue Severity
None
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issues