Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 2, 2025

Which issue does this PR close?

does ArrowStreamExportable to load the full data into memory or it is a recordbatch reader as I am getting OOM when used in smaller VM

Rationale for this change

Exporting DataFrame results via Arrow previously could trigger eager collection of the entire result set which risks exhausting process memory for large datasets. The project needs a zero-copy, lazy streaming path into PyArrow that:

  • Produces record batches incrementally (no full materialization),
  • Preserves partition ordering, and
  • Respects a requested schema (when trivial projections are possible) without eagerly collecting all batches.

This PR implements streaming-friendly paths in both the Rust extension and Python bindings, fixes some async/spawn patterns (improving signal handling and runtime usage), and adds tests and documentation to exercise the new behavior.

What changes are included in this PR?

High level

  • Implement __arrow_c_stream__ using a partitioned streaming reader that drains partition streams sequentially and exposes an Arrow ArrowArrayStream PyCapsule.
  • Add DataFrame iteration and async iteration support (Python): __iter__ and __aiter__ returning RecordBatch instances.
  • Add RecordBatch.__arrow_c_array__ for zero-copy export of individual record batches as Arrow C Data Interface capsules.
  • Use a helper spawn_future to run DataFusion futures on the Tokio runtime while preserving Python signal handling instead of directly creating JoinHandle/blocking joins.
  • New tests covering iteration, streaming, schema selection/mismatch, interruption by KeyboardInterrupt, and memory behavior when streaming large datasets.
  • Add a small testing helper tests/utils.py::range_table used to construct large range tables without expanding the public API.
  • Document the streaming behavior in the user guide under a new PyArrow Streaming section.
  • Add cstr dependency and small Cargo.toml tidy / formatting changes.

Files changed (summary)

Rust

  • src/dataframe.rs

    • Introduces PartitionedDataFrameStreamReader implementing RecordBatchReader that pulls batches from partitioned SendableRecordBatchStreams and applies per-batch projection if requested.
    • Reworks __arrow_c_stream__ to use execute_stream_partitioned() and create an FFI_ArrowArrayStream from the RecordBatchReader without materializing all batches.
    • Adds a stable C capsule name constant using the cstr crate.
    • Use spawn_future to run async tasks on the Tokio runtime.
  • src/record_batch.rs

    • Adds poll_next_batch helper and uses it to unify stream polling logic.
    • Fixes error propagation for next_stream.
  • src/utils.rs

    • Adds spawn_future utility that spawns a future on the shared Tokio runtime and waits for it while preserving Python signal behavior and converting errors appropriately.
  • src/context.rs

    • Replace ad-hoc runtime spawn/wait with spawn_future for execute_stream_partitioned/execution paths.

Python

  • python/datafusion/dataframe.py

    • Add __iter__, __aiter__ to iterate over RecordBatch objects produced by execute_stream().
    • Update docstrings and deprecate to_record_batch_stream (alias to execute_stream).
    • Fix imports to include RecordBatch.
  • python/datafusion/record_batch.py

    • Add __arrow_c_array__ to export a RecordBatch via Arrow C Data Interface (two capsules).
    • Clarify iterator/async iterator docstrings for RecordBatchStream.
  • python/tests/*

    • Many tests added/updated: new fixtures (fail_collect), tests for iteration (test_iter_batches, test_iter_returns_datafusion_recordbatch), streaming (`test_execute_stream_b

…ing in DataFrame

- Add `range` method to SessionContext and iterator support to DataFrame
- Introduce `spawn_stream` utility and refactor async execution for
  better signal handling
- Add tests for `KeyboardInterrupt` in `__arrow_c_stream__` and
  incremental DataFrame streaming
- Improve memory usage tracking in tests with psutil
- Update DataFrame docs with PyArrow streaming section and enhance
  `__arrow_c_stream__` documentation
- Replace Tokio runtime creation with `spawn_stream` in PySessionContext
- Bump datafusion packages to 49.0.1 and update dependencies
- Remove unused imports and restore main Cargo.toml
…andling

- Refactor record batch streaming to use `poll_next_batch` for clearer
  error handling
- Improve `spawn_future`/`spawn_stream` functions for better Python
  exception integration and code reuse
- Update `datafusion` and `datafusion-ffi` dependencies to 49.0.2
- Fix PyArrow `RecordBatchReader` import to use `_import_from_c_capsule`
  for safer memory handling
- Refactor `ArrowArrayStream` handling to use `PyCapsule` with
  destructor for improved memory management
- Refactor projection initialization in `PyDataFrame` for clarity
- Move `range` functionality into `_testing.py` helper
- Rename test column in `test_table_from_batches_stream` for accuracy
- Add tests for `RecordBatchReader` and enhance DataFrame stream
  handling
…docs

- Preserve partition order in DataFrame streaming and update related
  tests
- Add tests for record batch ordering and DataFrame batch iteration
- Improve `drop_stream` to correctly handle PyArrow ownership transfer
  and null pointers
- Replace `assert` with `debug_assert` for safer ArrowArrayStream
  validation
- Add documentation for `poll_next_batch` in PyRecordBatchStream
- Refactor tests to use `fail_collect` fixture for DataFrame collect
- Refactor `range_table` return type to `DataFrame` for clearer type
  hints
- Minor cleanup in SessionContext (remove extra blank line)
@kosiew kosiew changed the title DRAFT Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow) Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow) Sep 2, 2025
@kosiew kosiew marked this pull request as ready for review September 2, 2025 13:49
@kylebarron
Copy link
Contributor

I'm invested in this and plan to review this this afternoon!

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly advocate for less direct integration with pyarrow, not more. Pyarrow is a massive dependency, while the Arrow PyCapsule Interface should allow for better decentralized sharing of Arrow data.

Comment on lines 171 to 177
DataFrames are also iterable, yielding :class:`pyarrow.RecordBatch` objects
lazily so you can loop over results directly:

.. code-block:: python

for batch in df:
... # process each batch as it is produced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the user can iterate over the stream accessed by the target library, I don't think we should define our own custom integration here, and if we do, then the yielded object should not be a pyarrow RecordBatch, but rather an opaque, minimal Python class that just exposes __arrow_c_array__ so that the user can choose what Arrow library they want to use to work with the batch.

Copy link
Contributor

@kylebarron kylebarron Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have our own RecordBatch class: https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatch

Also, we should ensure that the dunder methods are rendered in the docs. It doesn't look like they are currently. (Or maybe the dunder methods on that RecordBatch aren't documented?)

return self.df.__arrow_c_stream__(requested_schema)

def __iter__(self) -> Iterator[pa.RecordBatch]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think there's a good rationale for having this method, especially as it reuses the exact same mechanism as the PyCapsule Interface. If anything, we might want to have an __aiter__ method that has a custom async connection to the DataFusion context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordBatchStream already has __iter__ and __aiter__ methods https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatchStream

Can we just have a method that converts a DataFrame into a RecordBatchStream? Then an __iter__ on DataFrame would just convert to a RecordBatchStream under the hood.

src/dataframe.rs Outdated
Comment on lines 62 to 64
#[allow(clippy::manual_c_str_literals)]
static ARROW_STREAM_NAME: &CStr =
unsafe { CStr::from_bytes_with_nul_unchecked(b"arrow_array_stream\0") };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by the linter, can we just use c"arrow_array_stream"?

src/dataframe.rs Outdated
Comment on lines 66 to 89
unsafe extern "C" fn drop_stream(capsule: *mut ffi::PyObject) {
if capsule.is_null() {
return;
}

// When PyArrow imports this capsule it steals the raw stream pointer and
// sets the capsule's internal pointer to NULL. In that case
// `PyCapsule_IsValid` returns 0 and this destructor must not drop the
// stream as ownership has been transferred to PyArrow. If the capsule was
// never imported, the pointer remains valid and we are responsible for
// freeing the stream here.
if ffi::PyCapsule_IsValid(capsule, ARROW_STREAM_NAME.as_ptr()) == 1 {
let stream_ptr = ffi::PyCapsule_GetPointer(capsule, ARROW_STREAM_NAME.as_ptr())
as *mut FFI_ArrowArrayStream;
if !stream_ptr.is_null() {
drop(Box::from_raw(stream_ptr));
}
}

// `PyCapsule_GetPointer` sets a Python error on failure. Clear it only
// after the stream has been released (or determined to be owned
// elsewhere).
ffi::PyErr_Clear();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to do any of this, according to upstream discussion apache/arrow-rs#5070 (comment)

self.schema.clone()
}
}

#[pymethods]
impl PyDataFrame {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this changes the DataFrame construct to always be "lazy"? Previously a DataFrame was always materialized in memory, whereas now it's just a representation of future batches?

src/dataframe.rs Outdated
let ffi_stream = FFI_ArrowArrayStream::new(reader);
let stream_capsule_name = CString::new("arrow_array_stream").unwrap();
PyCapsule::new(py, ffi_stream, Some(stream_capsule_name)).map_err(PyDataFusionError::from)
let stream = Box::new(FFI_ArrowArrayStream::new(reader));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an FFI_ArrowArrayStream you should be able to just pass that to PyCapsule::new without touching any unsafe: https://github.com/kylebarron/arro3/blob/cb2453bf022d0d8704e56e81a324ab5a772e0247/pyo3-arrow/src/ffi/to_python/utils.rs#L93-L94

@kylebarron
Copy link
Contributor

In #1227 I explicitly suggested removing the pyarrow dependency altogether. I thought I had created an issue before, but apparently not.

@kylebarron
Copy link
Contributor

This would also close #1011

kosiew and others added 22 commits September 13, 2025 16:37
…braries, clarifying the protocol and adding implementation-agnostic notes.
…tion about eager conversion, emphasizing on-demand batch processing to prevent memory exhaustion.
… RecordBatchReader in test_arrow_c_stream_schema_selection
@kosiew kosiew requested a review from kylebarron September 13, 2025 13:58
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a massive improvement! I took a skim through and at a high level looks like the correct way to approach it. I will try to take more time this week to review more carefully and to test a few things out myself.

Comment on lines 1157 to 1158
directly to remain compatible with Python < 3.10 (this project
supports Python >= 3.6).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our minimum supported version is 3.9 right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will amend the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

export to arrow generate OOM
3 participants