Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@lichuang
Copy link

Problem Statement

The original do_write_fragments function processed data sequentially:

  1. Read a batch of data
  2. Write it to a file
  3. Flush if file size/row limits exceeded
  4. Repeat

This sequential approach had several limitations:

  • I/O Bottlenecks: Writing one file blocked all other operations
  • Poor Resource Utilization: CPU and network bandwidth were underutilized during I/O waits
  • High Latency: Total time was sum of all write operations rather than parallel execution
  • No Concurrency: Multiple files couldn't be written simultaneously even when resources were available

Solution: Concurrent Multi-File Write

The optimized version (do_write_fragments_concurrent) implements a parallel processing pipeline:

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Data Stream Input                        │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
          ┌──────────────────────┐
          │   Batch Reader       │
          └──────┬───────────────┘
                 │
          ┌──────▼───────────────┐
          │  Load Balancer       │
          │  (Round-Robin)       │
          └──────┬───────────────┘
                 │
        ┌────────┼────────┬────────┬────────┐
        ▼        ▼        ▼        ▼        ▼
   ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
   │Worker 1│ │Worker 2│ │Worker 3│ │Worker 4│
   │Writer  │ │Writer  │ │Writer  │ │Writer  │
   └────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
        │          │          │          │
        ▼          ▼          ▼          ▼
    File 1     File 2     File 3     File 4

Key Components

  1. Active Writer Pool: Maintains up to MAX_CONCURRENT_WRITERS (default: 4) active file writers
  2. Smart Dispatch System: Distributes batches across writers using round-robin with capacity awareness
  3. Asynchronous Completion: When a file reaches limits, finishing happens in parallel while new writes continue
  4. Backpressure Control: Limits how many operations can be in-flight to prevent memory exhaustion

@github-actions github-actions bot added the enhancement New feature or request label Dec 31, 2025
@lichuang lichuang force-pushed the do_write_fragments_opt branch from f4c261c to 84c36f8 Compare December 31, 2025 12:28
@wjones127 wjones127 self-assigned this Dec 31, 2025
@wjones127
Copy link
Contributor

Hi @lichuang. This PR is interesting. I was thinking about doing something similar soon.

However, one drawback to doing this as written is if you have a small stream of data, then you can end up writing several small files, which is sub-optimal.

There's no way to know just from a stream how big it is. But if the input is materialized or otherwise provides information about how big the input is, then we can make smarter decisions about parallelization. That's why I wanted to implement #4583 first, before we implement a parallelized writer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants