-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Which part is this question about
Library API, specifically the interaction between ArrowRowGroupWriterFactory / ArrowColumnChunk (sync, parallel encoding) and AsyncArrowWriter (async, sequential encoding).
Describe your question
We're building a high-throughput streaming k-way merge for sorted Parquet files. The write pipeline looks like:
read (rayon decode + channel prefetch) → merge sort → parallel encode (rayon) → write to disk
We want both parallel column encoding and async disk writes. Currently the API only allows picking one.
Path A: Parallel encode, sync write
let col_writers = rg_writer_factory.create_column_writers(rg_index)?;
let chunks: Vec<ArrowColumnChunk> = rayon::install(|| {
leaves_and_writers
.into_par_iter()
.map(|(leaf, mut col_writer)| {
col_writer.write(&leaf)?;
col_writer.close()
})
.collect()
})?;
// append_to_row_group requires sync SerializedFileWriter
let mut rg = writer.next_row_group()?;
for chunk in chunks {
chunk.append_to_row_group(&mut rg)?;
}
rg.close()?;
Path B: Async write, sequential encode
let mut writer = AsyncArrowWriter::try_new(file, schema, Some(props))?;
writer.write(&batch).await?;
writer.close().await?;
The gap: ArrowColumnChunk (the output of parallel encoding) can only be appended through sync SerializedFileWriter. There's no async equivalent.
Question:
Is there a way to combine parallel encoding with async writes
Additional context
Both read (decode) and write (encode) use a shared rayon pool for parallelism, the only sync bottleneck is the actual disk write inside append_to_row_group