[Data] `write_parquet` doesn't respect `min_rows_per_file` if you specify partition columns

### What happened + What you expected to happen

If you specify partition columns, Ray Data might produce files with substantially fewer rows than the specified `min_rows_per_file`. The reason that this happens is because `min_rows_per_file` determines the minimum number of rows per write task. If there are multiple partitions within a write task, then the number of rows written to each file might be too small, even though the task as a whole wrote more than the minimum number of rows.

I don't know if there's an easy solution to this. Maybe we can emit a warning.

### Versions / Dependencies

bd412da04b9845101b6777db1d6381a6f4a917b6

### Reproduction script

```python
import os
import ray
import numpy as np
import pandas as pd
import pyarrow.fs

storage_path = "./data"
os.makedirs(storage_path, exist_ok=True)

NUM_ROWS = 100_000_000
YEARS = [2020, 2021, 2022, 2023, 2024]

initial_path = f"{storage_path}/initial_dataset"
partitioned_path = f"{storage_path}/partitioned_dataset"

print(f"Generating dataset with {NUM_ROWS:,} rows...")


def add_columns(batch):
    n = len(batch["id"])
    batch["month"] = np.random.randint(1, 13, size=n)
    batch["year"] = np.random.choice(YEARS, size=n)
    return batch


ds = ray.data.range(NUM_ROWS)
ds = ds.map_batches(add_columns, batch_format="pandas")

print(f"Writing initial dataset to {initial_path}...")
ds.write_parquet(initial_path)
print("Initial dataset written successfully.")

print(f"\nReading dataset from {initial_path}...")
ds_read = ray.data.read_parquet(initial_path)

print(f"Writing partitioned dataset to {partitioned_path}...")
print("Using partition_cols=['month', 'year'] and min_rows_per_file=10,000,000")
ds_read.write_parquet(
    partitioned_path, partition_cols=["month", "year"], min_rows_per_file=10_000_000
)
print("Partitioned dataset written successfully.")

print("\nValidating partitioned output...")
fs = pyarrow.fs.LocalFileSystem()
abs_path = os.path.abspath(partitioned_path)
files = fs.get_file_info(pyarrow.fs.FileSelector(abs_path, recursive=True))
parquet_files = [f.path for f in files if f.path.endswith(".parquet")]

print(f"Found {len(parquet_files)} parquet files")
sample_file = parquet_files[0]
print(f"Reading sample file: {sample_file}")
df = pd.read_parquet(sample_file)
print(f"Number of rows in sample file: {len(df):,}")

print("\nDone!")
```

### Issue Severity

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] `write_parquet` doesn't respect `min_rows_per_file` if you specify partition columns #59485

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] write_parquet doesn't respect min_rows_per_file if you specify partition columns #59485

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Data] `write_parquet` doesn't respect `min_rows_per_file` if you specify partition columns #59485