Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC adopt narwhals for dataframe support #31049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Mar 21, 2025 · 10 comments · May be fixed by #31127
Open

RFC adopt narwhals for dataframe support #31049

lorentzenchr opened this issue Mar 21, 2025 · 10 comments · May be fixed by #31127
Labels

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 21, 2025

At least as of SLEP018, scikit-learn supports dataframes passed as X. In #25896 is a further place of current discussions.

This issue is to discuss whether or not, or in which form, a future scikit-learn should depend on narwhals for general dataframe support.

+ wide df support
+ less maintenance within scikit-learn
- external dependency

@scikit-learn/core-devs @MarcoGorelli

@adrinjalali
Copy link
Member

I'm personally happy with depending on narwhals for all dataframe work.

I'm also okay with a hard dependency since it's a very lightweight lib with no transient dependencies. But I wouldn't say no to a soft dependency implementation, I just think it's nicer for users if it's always installed with sklearn.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 24, 2025

Is there a way to positionally select the rows with narwhals? (This is important for _safe_indexing)

Concretely, how do I get the [1, 3, 5] positional rows in a narwhals compliant dataframe? In pandas, it would be this:

import pandas as pd

x = pd.DataFrame({"a": [4, 2, 3, 4, 5, 6], "b": [3, 5, 14, 2, 421, 12]})

x.loc[[1, 3, 5], :]

@MarcoGorelli
Copy link
Contributor

Thanks for the ping!

Whether as a hard-dependency, soft-dependency, or vendored, I'd love to see this 😍

@thomasjpfan yup, numpy-style indexing is supported on narwhals.DataFrame objects (but not on narwhals.LazyFrame, where row order is undefined):

In [8]: df = nw.from_native(x, eager_only=True)

In [9]: df
Out[9]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|       a    b     |
|    0  4    3     |
|    1  2    5     |
|    2  3   14     |
|    3  4    2     |
|    4  5  421     |
|    5  6   12     |
└──────────────────┘

In [10]: df[[1,3,5]]
Out[10]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|        a   b     |
|     1  2   5     |
|     3  4   2     |
|     5  6  12     |
└──────────────────┘

See DataFrame.getitem. PyShiny makes heavy use of it

You may also be interested in what about the pandas index?

@lorentzenchr
Copy link
Member Author

So currently, there is no critical voice, but still time to raise one. Please do so.

Let's see how contributors would prefer to add narwhals:

Please cast a vote by selecting one of the 2 emoijis as reaction to this comment.
Disclaimer: While still kind of informal, this will shed light on how to proceed.

@lorentzenchr lorentzenchr linked a pull request Apr 1, 2025 that will close this issue
@lesteve
Copy link
Member

lesteve commented Apr 2, 2025

OK nobody was bold enough to raise concerns yet, so let me the first one to do it 😅.

Being not very familiar with the space, I would need more information to first form an opinion and eventually try to reach a decision on this.

I have to admit my default reaction is conservatism. Adding a new dependency in scikit-learn has always been done with careful consideration about the trade-offs involved.

One of my worry: I have to admit that from afar this ecosystem (to simplify let's say narwhals + polars) moves very fast (which you know can also be a good thing 😉) and there seems to be an impedance mismatch with how fast the core Scientific Python ecosystem is moving.

What does it bring us?

This has probably been discussed in other issues but it would be nice to have it summarized in this issue where we are trying to make a decision.

  • what does depending on narwhals bring us? Is it mostly to do pandas + polars conveniently? Are there any issues with our pandas + polars implementation that depending on narwhals would fix?
  • are there many users who care about support for an other dataframe library than pandas or polars?
  • What kind of workflows would depending on narwhals unlock for you? What kind of work-around do you currently use?

How will our code change? aka what does it cost us?

  • how much of our code will need to change? Is it mostly a few (5-10) key functions?
  • what is your plan for the future? Would the switch to narwhals be done in a single PR? Is it going to be a lengthy per-function/estimator process like array API support (hopefully not 😅)?
  • not sure how much work this is, but it would help a lot if your WIP PR ENH add narwhals as dependency #31127 was changing our code in a way that enables you to do something that wasn't possible before. This would help having some feeling how much change is needed from us to switch to narwhals.

I guess linking to pull requests from other projects that switched to narwhals would be useful as well.

Discussion of the risks involved

Here are a few things off the top of my head, it would be nice to have these aspects detailed as well in the cost-benefit analysis:

  • narwhals stops being maintained and we end up having to maintain a fork. I guess unlikely in the short-term but you know it's hard to make predictions, especially about the future 🔮.
  • intersection with array API? narwhals can support Dask for example, and with array-api-extra array API also supports Dask, how does it interact together in our code base, any caveats there?
  • any consideration about the lazy aspects, any insights about supporting it, is it out of scope for now?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 2, 2025 via email

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Apr 2, 2025

fairlearn has just started to work on introducing narwhals one month ago. Here's the issue if somebody wants to have a look: fairlearn/fairlearn#1522

This was preceded by a discovery process led by the narwhals people, who are very helpful and quick! And also very interested to spread narwhals as wide as possible. I am sure we could ask them for a similar exploration for scikit-learn before making a decision.

The large difference to scikit-learn is that fairlearn uses pandas extensively within the classes, whereas scikit-learn converts to numpy (or another array library) at the earliest moment.

I agree with the more concerned voices who want to explore the extend first and understand the gains:

There is a large interruption between input and output already (in terms of using types from dataframe libraries), so where would narwhals then be beneficial? For converting the outputs into the same type as the inputs, if the user uses set_output()? Is that gain large enough compared to individual pandas/polars/ etc. conversions?

narwhals also supports more than pandas and polars, for instance pyarrow which has some efficiency-advantages, but not directly within scikit-learn (since array libraries are used). Is there any harm to convert the output afterwards, if users want to use it? I cannot see any. Users know what they want and don't need to write complicated conditions fitting any case, when they use an estimators predictions or a transformer's output.

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 2, 2025

what does, for polars support, narwhals gain us? We already have a decent support of polars.

The main benefit I see is to have one code path that supports both pandas and polars. And open the door to toggle on other dataframe libraries.


It would be great to use their dataframe-first expression API, but that brings us too far from our current way of doing things. Status quo is:

  1. ndarray -> compute on ndarray
  2. array-api array -> compute on array-api array
  3. dataframe -> convert to ndarray -> compute on ndarray

If we add dataframe-first expression API for compute, it'll add a third code path, which I do not really like:

  1. dataframe -> compute using dataframe expression API

(Although 4. means we can have lazy dataframe support)


That being said, I think there is a net win for vendoring narwhals to more easily support polars & pandas at the same time.

what is your plan for the future? Would the switch to narwhals be done in a single PR? Is it going to be a lengthy per-function/estimator process like array API support (hopefully not 😅)?

For me, I would:

  1. Vendor narwhals and use v1 of the narwhals spec. (or v2 when they release it)
  2. I suspect most of the changes are in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py and
    def _safe_indexing(X, indices, *, axis=0):
  3. There are a few places where we use the dataframe interchange, which most likely need to update to narwhals.

From a user point of view, narhwals code change would be invisible to them and they get the same feature set.

@MarcoGorelli
Copy link
Contributor

Thanks all for the discussion! Just wanted to comment on

Vendor narwhals and use v1 of the narwhals spec. (or v2 when they release it)

I'd suggest either using stable.v1 or vendoring, doing both is probably unnecessary.

The idea behind stable.v1 is: "code you write today will keep working in all future versions of Narwhals". It guards against deprecations and breaking changes. If you're vendoring, then there's no risk of deprecations / breaking changes anyway 😄

@DeaMariaLeon
Copy link
Contributor

For people who couldn't attend yesterday's call (04/29/25):

Narwhals runs all the downstream libraries' tests. This ensures that the projects which have adopted it don't break. So, scikit-learn's tests would be run as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants