RFC adopt narwhals for dataframe support #31049

lorentzenchr · 2025-03-21T13:15:28Z

At least as of SLEP018, scikit-learn supports dataframes passed as X. In #25896 is a further place of current discussions.

This issue is to discuss whether or not, or in which form, a future scikit-learn should depend on narwhals for general dataframe support.

+ wide df support
+ less maintenance within scikit-learn
- external dependency

@scikit-learn/core-devs @MarcoGorelli

The text was updated successfully, but these errors were encountered:

adrinjalali · 2025-03-21T17:02:48Z

I'm personally happy with depending on narwhals for all dataframe work.

I'm also okay with a hard dependency since it's a very lightweight lib with no transient dependencies. But I wouldn't say no to a soft dependency implementation, I just think it's nicer for users if it's always installed with sklearn.

thomasjpfan · 2025-03-24T16:30:56Z

Is there a way to positionally select the rows with narwhals? (This is important for _safe_indexing)

Concretely, how do I get the [1, 3, 5] positional rows in a narwhals compliant dataframe? In pandas, it would be this:

import pandas as pd

x = pd.DataFrame({"a": [4, 2, 3, 4, 5, 6], "b": [3, 5, 14, 2, 421, 12]})

x.loc[[1, 3, 5], :]

MarcoGorelli · 2025-03-24T16:50:11Z

Thanks for the ping!

Whether as a hard-dependency, soft-dependency, or vendored, I'd love to see this 😍

@thomasjpfan yup, numpy-style indexing is supported on narwhals.DataFrame objects (but not on narwhals.LazyFrame, where row order is undefined):

In [8]: df = nw.from_native(x, eager_only=True)

In [9]: df
Out[9]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|       a    b     |
|    0  4    3     |
|    1  2    5     |
|    2  3   14     |
|    3  4    2     |
|    4  5  421     |
|    5  6   12     |
└──────────────────┘

In [10]: df[[1,3,5]]
Out[10]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|        a   b     |
|     1  2   5     |
|     3  4   2     |
|     5  6  12     |
└──────────────────┘

See DataFrame.getitem. PyShiny makes heavy use of it

You may also be interested in what about the pandas index?

lorentzenchr · 2025-04-01T20:57:14Z

So currently, there is no critical voice, but still time to raise one. Please do so.

Let's see how contributors would prefer to add narwhals:

🚀 add narwhals as dependency like joblib
🎉 vendor narwhals within scikit-learn like liblinear or array-api-compat MNT co-vendor array-api-{compat, extra} #30340

Please cast a vote by selecting one of the 2 emoijis as reaction to this comment.
Disclaimer: While still kind of informal, this will shed light on how to proceed.

lesteve · 2025-04-02T04:56:11Z

OK nobody was bold enough to raise concerns yet, so let me the first one to do it 😅.

Being not very familiar with the space, I would need more information to first form an opinion and eventually try to reach a decision on this.

I have to admit my default reaction is conservatism. Adding a new dependency in scikit-learn has always been done with careful consideration about the trade-offs involved.

One of my worry: I have to admit that from afar this ecosystem (to simplify let's say narwhals + polars) moves very fast (which you know can also be a good thing 😉) and there seems to be an impedance mismatch with how fast the core Scientific Python ecosystem is moving.

What does it bring us?

This has probably been discussed in other issues but it would be nice to have it summarized in this issue where we are trying to make a decision.

what does depending on narwhals bring us? Is it mostly to do pandas + polars conveniently? Are there any issues with our pandas + polars implementation that depending on narwhals would fix?
are there many users who care about support for an other dataframe library than pandas or polars?
What kind of workflows would depending on narwhals unlock for you? What kind of work-around do you currently use?

How will our code change? aka what does it cost us?

how much of our code will need to change? Is it mostly a few (5-10) key functions?
what is your plan for the future? Would the switch to narwhals be done in a single PR? Is it going to be a lengthy per-function/estimator process like array API support (hopefully not 😅)?
not sure how much work this is, but it would help a lot if your WIP PR ENH add narwhals as dependency #31127 was changing our code in a way that enables you to do something that wasn't possible before. This would help having some feeling how much change is needed from us to switch to narwhals.

I guess linking to pull requests from other projects that switched to narwhals would be useful as well.

Discussion of the risks involved

Here are a few things off the top of my head, it would be nice to have these aspects detailed as well in the cost-benefit analysis:

narwhals stops being maintained and we end up having to maintain a fork. I guess unlikely in the short-term but you know it's hard to make predictions, especially about the future 🔮.
intersection with array API? narwhals can support Dask for example, and with array-api-extra array API also supports Dask, how does it interact together in our code base, any caveats there?
any consideration about the lazy aspects, any insights about supporting it, is it out of scope for now?

GaelVaroquaux · 2025-04-02T05:57:35Z

The way that I had understood this proposal was: optional dependency not needed for pandas support, but needed for other dataframe support. I'm not against that proposal :). But I was thinking: there will be an element of surprise, of "discomfort", which is that when users pass in a polars dataframe, they will get an error and will need to install narwhals to fix this error. So my question is: what does, for polars support, narwhals gain us? We already have a decent support of polars.

StefanieSenger · 2025-04-02T07:18:01Z

fairlearn has just started to work on introducing narwhals one month ago. Here's the issue if somebody wants to have a look: fairlearn/fairlearn#1522

This was preceded by a discovery process led by the narwhals people, who are very helpful and quick! And also very interested to spread narwhals as wide as possible. I am sure we could ask them for a similar exploration for scikit-learn before making a decision.

The large difference to scikit-learn is that fairlearn uses pandas extensively within the classes, whereas scikit-learn converts to numpy (or another array library) at the earliest moment.

I agree with the more concerned voices who want to explore the extend first and understand the gains:

There is a large interruption between input and output already (in terms of using types from dataframe libraries), so where would narwhals then be beneficial? For converting the outputs into the same type as the inputs, if the user uses set_output()? Is that gain large enough compared to individual pandas/polars/ etc. conversions?

narwhals also supports more than pandas and polars, for instance pyarrow which has some efficiency-advantages, but not directly within scikit-learn (since array libraries are used). Is there any harm to convert the output afterwards, if users want to use it? I cannot see any. Users know what they want and don't need to write complicated conditions fitting any case, when they use an estimators predictions or a transformer's output.

thomasjpfan · 2025-04-02T14:01:52Z

what does, for polars support, narwhals gain us? We already have a decent support of polars.

The main benefit I see is to have one code path that supports both pandas and polars. And open the door to toggle on other dataframe libraries.

It would be great to use their dataframe-first expression API, but that brings us too far from our current way of doing things. Status quo is:

ndarray -> compute on ndarray
array-api array -> compute on array-api array
dataframe -> convert to ndarray -> compute on ndarray

If we add dataframe-first expression API for compute, it'll add a third code path, which I do not really like:

dataframe -> compute using dataframe expression API

(Although 4. means we can have lazy dataframe support)

That being said, I think there is a net win for vendoring narwhals to more easily support polars & pandas at the same time.

what is your plan for the future? Would the switch to narwhals be done in a single PR? Is it going to be a lengthy per-function/estimator process like array API support (hopefully not 😅)?

For me, I would:

Vendor narwhals and use v1 of the narwhals spec. (or v2 when they release it)
I suspect most of the changes are in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py and

scikit-learn/sklearn/utils/_indexing.py

Line 180 in 812ff67

def _safe_indexing(X, indices, *, axis=0):
There are a few places where we use the dataframe interchange, which most likely need to update to narwhals.

From a user point of view, narhwals code change would be invisible to them and they get the same feature set.

MarcoGorelli · 2025-04-24T11:00:03Z

Thanks all for the discussion! Just wanted to comment on

Vendor narwhals and use v1 of the narwhals spec. (or v2 when they release it)

I'd suggest either using stable.v1 or vendoring, doing both is probably unnecessary.

The idea behind stable.v1 is: "code you write today will keep working in all future versions of Narwhals". It guards against deprecations and breaking changes. If you're vendoring, then there's no risk of deprecations / breaking changes anyway 😄

DeaMariaLeon · 2025-04-29T09:36:56Z

For people who couldn't attend yesterday's call (04/29/25):

Narwhals runs all the downstream libraries' tests. This ensures that the projects which have adopted it don't break. So, scikit-learn's tests would be run as well.

lorentzenchr added the RFC label Mar 21, 2025

adrinjalali mentioned this issue Mar 21, 2025

Allow column names to pass through when fitting narwhals dataframes #31019

Closed

MarcoGorelli mentioned this issue Mar 27, 2025

fix(python): Give priority to pycapsule interface in from_dataframe pola-rs/polars#21377

Merged

lorentzenchr linked a pull request Apr 1, 2025 that will close this issue

ENH add narwhals as dependency #31127

Open

lorentzenchr mentioned this issue Apr 13, 2025

Research: how would Narwhals work in scikit-learn? narwhals-dev/narwhals#355

Open

jeremiedbb mentioned this issue Apr 17, 2025

FIX _safe_indexing for pyarrow #31040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC adopt narwhals for dataframe support #31049

RFC adopt narwhals for dataframe support #31049

lorentzenchr commented Mar 21, 2025 •

edited

Loading

adrinjalali commented Mar 21, 2025

thomasjpfan commented Mar 24, 2025 •

edited

Loading

MarcoGorelli commented Mar 24, 2025

lorentzenchr commented Apr 1, 2025

lesteve commented Apr 2, 2025 •

edited

Loading

GaelVaroquaux commented Apr 2, 2025 via email

StefanieSenger commented Apr 2, 2025 •

edited

Loading

thomasjpfan commented Apr 2, 2025 •

edited

Loading

MarcoGorelli commented Apr 24, 2025

DeaMariaLeon commented Apr 29, 2025

RFC adopt narwhals for dataframe support #31049

RFC adopt narwhals for dataframe support #31049

Comments

lorentzenchr commented Mar 21, 2025 • edited Loading

adrinjalali commented Mar 21, 2025

thomasjpfan commented Mar 24, 2025 • edited Loading

MarcoGorelli commented Mar 24, 2025

lorentzenchr commented Apr 1, 2025

lesteve commented Apr 2, 2025 • edited Loading

What does it bring us?

How will our code change? aka what does it cost us?

Discussion of the risks involved

GaelVaroquaux commented Apr 2, 2025 via email

StefanieSenger commented Apr 2, 2025 • edited Loading

thomasjpfan commented Apr 2, 2025 • edited Loading

MarcoGorelli commented Apr 24, 2025

DeaMariaLeon commented Apr 29, 2025

lorentzenchr commented Mar 21, 2025 •

edited

Loading

thomasjpfan commented Mar 24, 2025 •

edited

Loading

lesteve commented Apr 2, 2025 •

edited

Loading

StefanieSenger commented Apr 2, 2025 •

edited

Loading

thomasjpfan commented Apr 2, 2025 •

edited

Loading