Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Combine parallel column encoding (ArrowRowGroupWriterFactory) with async writes (AsyncArrowWriter) #9499

@Shailesh-Kumar-Singh

Description

@Shailesh-Kumar-Singh

Which part is this question about

Library API, specifically the interaction between ArrowRowGroupWriterFactory / ArrowColumnChunk (sync, parallel encoding) and AsyncArrowWriter (async, sequential encoding).

Describe your question
We're building a high-throughput streaming k-way merge for sorted Parquet files. The write pipeline looks like:
read (rayon decode + channel prefetch) → merge sort → parallel encode (rayon) → write to disk
We want both parallel column encoding and async disk writes. Currently the API only allows picking one.
Path A: Parallel encode, sync write

let col_writers = rg_writer_factory.create_column_writers(rg_index)?;
let chunks: Vec<ArrowColumnChunk> = rayon::install(|| {
    leaves_and_writers
        .into_par_iter()
        .map(|(leaf, mut col_writer)| {
            col_writer.write(&leaf)?;
            col_writer.close()
        })
        .collect()
})?;

// append_to_row_group requires sync SerializedFileWriter
let mut rg = writer.next_row_group()?;
for chunk in chunks {
    chunk.append_to_row_group(&mut rg)?;
}
rg.close()?;

Path B: Async write, sequential encode

let mut writer = AsyncArrowWriter::try_new(file, schema, Some(props))?;
writer.write(&batch).await?;
writer.close().await?;

The gap: ArrowColumnChunk (the output of parallel encoding) can only be appended through sync SerializedFileWriter. There's no async equivalent.

Question:
Is there a way to combine parallel encoding with async writes

Additional context
Both read (decode) and write (encode) use a shared rayon pool for parallelism, the only sync bottleneck is the actual disk write inside append_to_row_group

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions