-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RFC adopt narwhals for dataframe support #31049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm personally happy with depending on narwhals for all dataframe work. I'm also okay with a hard dependency since it's a very lightweight lib with no transient dependencies. But I wouldn't say no to a soft dependency implementation, I just think it's nicer for users if it's always installed with sklearn. |
Is there a way to positionally select the rows with narwhals? (This is important for Concretely, how do I get the import pandas as pd
x = pd.DataFrame({"a": [4, 2, 3, 4, 5, 6], "b": [3, 5, 14, 2, 421, 12]})
x.loc[[1, 3, 5], :] |
Thanks for the ping! Whether as a hard-dependency, soft-dependency, or vendored, I'd love to see this 😍 @thomasjpfan yup, numpy-style indexing is supported on In [8]: df = nw.from_native(x, eager_only=True)
In [9]: df
Out[9]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 0 4 3 |
| 1 2 5 |
| 2 3 14 |
| 3 4 2 |
| 4 5 421 |
| 5 6 12 |
└──────────────────┘
In [10]: df[[1,3,5]]
Out[10]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 1 2 5 |
| 3 4 2 |
| 5 6 12 |
└──────────────────┘ See DataFrame.getitem. PyShiny makes heavy use of it You may also be interested in what about the pandas index? |
So currently, there is no critical voice, but still time to raise one. Please do so. Let's see how contributors would prefer to add narwhals:
Please cast a vote by selecting one of the 2 emoijis as reaction to this comment. |
OK nobody was bold enough to raise concerns yet, so let me the first one to do it 😅. Being not very familiar with the space, I would need more information to first form an opinion and eventually try to reach a decision on this. I have to admit my default reaction is conservatism. Adding a new dependency in scikit-learn has always been done with careful consideration about the trade-offs involved. One of my worry: I have to admit that from afar this ecosystem (to simplify let's say narwhals + polars) moves very fast (which you know can also be a good thing 😉) and there seems to be an impedance mismatch with how fast the core Scientific Python ecosystem is moving. What does it bring us?This has probably been discussed in other issues but it would be nice to have it summarized in this issue where we are trying to make a decision.
How will our code change? aka what does it cost us?
I guess linking to pull requests from other projects that switched to narwhals would be useful as well. Discussion of the risks involvedHere are a few things off the top of my head, it would be nice to have these aspects detailed as well in the cost-benefit analysis:
|
The way that I had understood this proposal was: optional dependency not needed for pandas support, but needed for other dataframe support. I'm not against that proposal :).
But I was thinking: there will be an element of surprise, of "discomfort", which is that when users pass in a polars dataframe, they will get an error and will need to install narwhals to fix this error.
So my question is: what does, for polars support, narwhals gain us? We already have a decent support of polars.
|
fairlearn has just started to work on introducing narwhals one month ago. Here's the issue if somebody wants to have a look: fairlearn/fairlearn#1522 This was preceded by a discovery process led by the narwhals people, who are very helpful and quick! And also very interested to spread narwhals as wide as possible. I am sure we could ask them for a similar exploration for scikit-learn before making a decision. The large difference to scikit-learn is that fairlearn uses pandas extensively within the classes, whereas scikit-learn converts to numpy (or another array library) at the earliest moment. I agree with the more concerned voices who want to explore the extend first and understand the gains: There is a large interruption between input and output already (in terms of using types from dataframe libraries), so where would narwhals then be beneficial? For converting the outputs into the same type as the inputs, if the user uses narwhals also supports more than |
The main benefit I see is to have one code path that supports both pandas and polars. And open the door to toggle on other dataframe libraries. It would be great to use their dataframe-first expression API, but that brings us too far from our current way of doing things. Status quo is:
If we add dataframe-first expression API for compute, it'll add a third code path, which I do not really like:
(Although 4. means we can have lazy dataframe support) That being said, I think there is a net win for vendoring narwhals to more easily support polars & pandas at the same time.
For me, I would:
From a user point of view, narhwals code change would be invisible to them and they get the same feature set. |
Thanks all for the discussion! Just wanted to comment on
I'd suggest either using The idea behind |
For people who couldn't attend yesterday's call (04/29/25): Narwhals runs all the downstream libraries' tests. This ensures that the projects which have adopted it don't break. So, scikit-learn's tests would be run as well. |
At least as of SLEP018, scikit-learn supports dataframes passed as
X
. In #25896 is a further place of current discussions.This issue is to discuss whether or not, or in which form, a future scikit-learn should depend on narwhals for general dataframe support.
+
wide df support+
less maintenance within scikit-learn-
external dependency@scikit-learn/core-devs @MarcoGorelli
The text was updated successfully, but these errors were encountered: