-
-
Notifications
You must be signed in to change notification settings - Fork 406
feat: Support Narwhals #6567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Support Narwhals #6567
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6567 +/- ##
==========================================
+ Coverage 88.97% 89.06% +0.09%
==========================================
Files 328 331 +3
Lines 70320 71147 +827
==========================================
+ Hits 62570 63370 +800
- Misses 7750 7777 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -645,7 +645,7 @@ def select(self, selection_expr=None, selection_specs=None, **selection): | |||
return self | |||
|
|||
# Handle selection dim expression | |||
if selection_expr is not None: | |||
if selection_expr is not None and selection_expr.ops: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also make something like hv.Dataset(df_pandas).select(selection_expr=hv.dim('a'))
fail
01172b1
to
9d8fc8e
Compare
datatypes = ['dataframe', 'dictionary', 'grid', 'xarray', 'multitabular', | ||
'spatialpandas', 'dask_spatialpandas', 'dask', 'cuDF', 'array', | ||
'spatialpandas', 'dask_spatialpandas', 'dask', 'cuDF', 'array', 'narwhals', | ||
'ibis'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open for updating the position here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Position seems fine, though question is at what point we deprecate dask and cuDF interfaces (if ever).
I think this PR is in a good state. Likely, there are some rough edges, but don’t think that should stop a review/merge. |
if isinstance(df, (nw.DataFrame, nw.LazyFrame)): | ||
df = df.select(list(map(str, kdims + vdims))) | ||
if isinstance(df, nw.LazyFrame): | ||
df = df.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which LazyFrame types does narwhals support these days? Can we check if it's backed by a dask dataframe and avoid collect for that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup you can do if df.implementation.is_dask():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have done this in 2c6388e
return False | ||
|
||
|
||
class NarwhalsDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would implement type
as well if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented like this:
@property
def type(self):
return type(self.dtype)
col = nw.col(name) | ||
else: | ||
col = nw.col(name).drop_nulls() | ||
# NOTE: Some narwhals backends (duckdb) will return nan as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there's no nanmin/nanmax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars defaults to ignoring NaN values, so Polars' nan_min
actually propagates nan
instead of ignoring them
In general NaN values are rare to encounter outside of pandas, as other libraries have proper null value support (and uniformly use null
to indicate missing data) and only result from undefined mathematical operations like 0/0
or log(-1)
. My general advise is to just deal with null values (as you're already doing here with drop_nulls
) and let each backend deal with its own definition of null
, and then let NaN be treated as a user-error. Alternatively, you can call fill_nan but note that it's only supported on float columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion previously #6567 (comment) and #6567 (comment).
holoviews/core/data/narwhals.py
Outdated
if isinstance(selection_mask, np.ndarray): | ||
# Boolean ndarray does not work, so we convert it to list | ||
# If the dtype is not boolean, we let narwhals error in filter | ||
selection_mask = selection_mask.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that sucks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey - I just checked and Polars allows this:
In [4]: df = pl.DataFrame({'a': [3,3,2,1], 'b': [3,2,2,1], 'c': [1,2,3,4]})
In [5]: df.filter((df['a']>1).to_numpy())
Out[5]:
shape: (3, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 3 ┆ 1 │
│ 3 ┆ 2 ┆ 2 │
│ 2 ┆ 2 ┆ 3 │
└─────┴─────┴─────┘
whereas Narwhals errors
I think we should allow for this in Narwhals (at least, for the eager case). Would it be OK for you to have this supported just for the eager case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't think it can be supported properly for the lazy case, e.g. for our dask interface we don't support it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey sorry to chime in. Without waiting for a new narwhals release/having to pin narwhals to latest one, it is still possible to avoid casting to list by converting numpy to a narwhals series backed by the backend:
import narwhals as nw
import numpy as np
import pyarrow as pa
frame = nw.from_native(pa.table({"a": [1,2,3]}), eager_only=True)
mask_np = np.array([True, False, True])
mask_nw = nw.new_series(
name="mask",
values=mask_np,
backend=frame.implementation,
)
# or in more recent versions, even better
# mask_nw = nw.Series.from_numpy("mask", mask_np, backend=frame.implementation)
frame.filter(mask_nw)
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| pyarrow.Table |
| a: int64 |
| ---- |
| a: [[1,3]] |
└──────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems more elegant, thanks @FBruzzesi.
And you are always welcome to review the code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return data.collect()[dim.name] | ||
else: | ||
return data # Cannot slice LazyFrame | ||
return data[dim.name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the return type here, e.g. pandas this would be a pd.Series if keep_index else np.ndarray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return a nw.Series, except if compute
is False, then a LazyFrame
input will return a single-column LazyFrame
.
|
||
""" | ||
if issubclass(dataset.interface, NarwhalsInterface): | ||
return dataset.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a considerable broadening of our definition of DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to force it to pandas? Or should we try this out for now and enable it if we see too many problems with it?
What is the plan for documenting this? |
I don't think there should be many documentation updates for this, other than the updates I pushed in 4c80fdb. We should definitely highlight that this is now supported in the release notes and other announcements. |
Still very much draft... A lot of the logic is currently copied/pasted from the
PandasInterface
.