Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

ntjohnson1
Copy link
Member

@ntjohnson1 ntjohnson1 commented Jul 18, 2025

Related

Closes #10695

What

This add the datafusion version we specify in our examples as a lower bound to make sure datafusion gets install with rerun before we use it. Added a simple smoke test to repro the issue then verified it passed with the install specification.

Right now we have some optional dependencies specified with our package. However we don't handle them very carefully. So importing different parts of the package could blow up. Taking inspiration from pandas I deferred imports so we should be able to import anything and not hit errors unless we ACTUALLY use the thing with the dependency. It also makes it clearer how to resolve things since it wasn't obvious to me that our notebook was already incorporated as an optional dependency. python -c "import rerun.notebook" just errors.

Adds rerun[datafusion] for the datafusion dependencies and rerun[all] to get all of the non-testing features.

If we like this approach we should be able to update the couple of different internal places where we duplicate notebook checking to use this global check approach.

@ntjohnson1 ntjohnson1 added the exclude from changelog PRs with this won't show up in CHANGELOG.md label Jul 18, 2025
Copy link

github-actions bot commented Jul 18, 2025

Latest documentation preview deployed successfully.

Result Commit Link
3a34bec https://landing-5rgn0tp4f-rerun.vercel.app/docs

Note: This comment is updated whenever you push a commit.

Copy link

github-actions bot commented Jul 18, 2025

Web viewer built successfully. If applicable, you should also test it:

  • I have tested the web viewer
Result Commit Link Manifest
3a34bec https://rerun.io/viewer/pr/10696 +nightly +main

Note: This comment is updated whenever you push a commit.

"pillow>=8.0.0", # Used for JPEG encoding. 8.0.0 added the `format` arguments to `Image.open`
"pyarrow>=18.0.0",
"typing_extensions>=4.5", # Used for PEP-702 deprecated decorator
"datafusion>=45.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the best practice is going to be to keep the datafusion python version the same as the datafusion rust version we use for it's FFI crate. They are currently cross-version compatible, but there have been some breaking changes in the most recent release.

That does mean we'd want users to be on 47.0.0.

Also, more generally, I think we don't want this as a dependency in rerun, because it means any of our open source viewer users who are not using datafusion now have to bring in this fairly large package. This is why it hasn't been put in as dependency before.

I know some packages do things like pip install rerun-sdk[datafusion] type things to add in additional dependencies, but I am not sure how that is done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha I'll take a look at adding it as optional and hiding the component without that. I'm sure its just a maturing flag. We have a similar issue in our notebook apparently so I'm doing additional larger cleanup for some of the things this kicked up.

@ntjohnson1 ntjohnson1 changed the title Require Datafusion Since We Import It Make Optional Dependencies Clearer Jul 18, 2025
Copy link
Member

@abey79 abey79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Comment on lines +7 to +14
from rerun.error_utils import RerunOptionalDependencyError

HAS_DATAFUSION = True
try:
from datafusion import Expr, ScalarUDF, col, udf
except ModuleNotFoundError:
HAS_DATAFUSION = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the type checker able to detect if a new method accesses eg Expr without first testing HAS_DATAFUSION?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Our type checking is SO BROKEN right now #10704. Maybe I merge this and then pull main into that branch to check it out.

@ntjohnson1 ntjohnson1 added this to the 0.24.1 (maybe) milestone Jul 24, 2025
@ntjohnson1 ntjohnson1 merged commit c5e2c66 into main Jul 24, 2025
47 checks passed
@ntjohnson1 ntjohnson1 deleted the nick/datafusion_import branch July 24, 2025 12:14
@emilk emilk changed the title Make Optional Dependencies Clearer Carify optional rerun-sdk dependencies Aug 6, 2025
@emilk emilk changed the title Carify optional rerun-sdk dependencies Clarify optional rerun-sdk dependencies Aug 6, 2025
@emilk emilk added the sdk-python Python logging API label Aug 6, 2025
@emilk emilk changed the title Clarify optional rerun-sdk dependencies Add rerun-sdk[datafusion] and rerun-sdk[all] Aug 6, 2025
@emilk emilk added include in changelog and removed exclude from changelog PRs with this won't show up in CHANGELOG.md labels Aug 6, 2025
@emilk emilk mentioned this pull request Aug 6, 2025
9 tasks
ntjohnson1 added a commit that referenced this pull request Aug 6, 2025
Closes #10695

~This add the datafusion version we specify in our examples as a lower
bound to make sure datafusion gets install with rerun before we use it.
Added a simple smoke test to repro the issue then verified it passed
with the install specification.~

Right now we have some optional dependencies specified with our package.
However we don't handle them very carefully. So importing different
parts of the package could blow up. Taking inspiration from pandas I
deferred imports so we should be able to import anything and not hit
errors unless we ACTUALLY use the thing with the dependency. It also
makes it clearer how to resolve things since it wasn't obvious to me
that our notebook was already incorporated as an optional dependency.
`python -c "import rerun.notebook"` just errors.

Adds `rerun[datafusion]` for the datafusion dependencies and
`rerun[all]` to get all of the non-testing features.

If we like this approach we should be able to update the couple of
different [internal
places](https://github.com/rerun-io/rerun/blob/7484e03f9a98341114c30abad49895258288df76/rerun_py/rerun_sdk/rerun/recording_stream.py#L900)
where we duplicate notebook checking to use this global check approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Datafusion dependency explicit
4 participants