Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Allow column names to pass through when fitting narwhals dataframes #31019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryansheabla opened this issue Mar 18, 2025 · 4 comments
Closed
Labels
Needs Triage Issue requires triage New Feature

Comments

@ryansheabla
Copy link

Describe the workflow you want to enable

Currently when fitting with a narwhals DataFrame, the feature names do not pass through because it does not implement a __dataframe__ method.

Example:

import narwhals as nw
import pandas as pd
import polars as pl
from sklearn.preprocessing import StandardScaler

df_pd = pd.DataFrame({"a": [0, 1, 2], "b": [3, 4, 5]})
df_pl = pl.DataFrame(df_pd)
df_nw = nw.from_native(df_pd)

s_pd, s_pl, s_nw = StandardScaler(), StandardScaler(), StandardScaler()
s_pd.fit(df_pd)
s_pl.fit(df_pl)
s_nw.fit(df_nw)

print(s_pd.feature_names_in_)
print(s_pl.feature_names_in_)
print(s_nw.feature_names_in_)

Expected output

['a' 'b']
['a' 'b']
['a' 'b']

Actual output

['a' 'b']
['a' 'b']
AttributeError: 'StandardScaler' object has no attribute 'feature_names_in_'

All other attributes on s_nw are what I'd expect.

Describe your proposed solution

This should be easy enough to implement by adding another check within sklearn.utils.validation._get_feature_names:

  1. Add _is_narwhals_df method, borrowing logic from _is_pandas_df
def _is_narwhals_df(X):
    """Return True if the X is a narwhals dataframe."""
    try:
        nw = sys.modules["narwhals"]
    except KeyError:
        return False
    return isinstance(X, nw.DataFrame)
  1. Add an additional check to _get_feature_names:
    elif _is_narwhals_df(X):
        feature_names = np.asarray(X.columns, dtype=object)

Describe alternatives you've considered, if relevant

No response

Additional context

narwhals-dev/narwhals#355 (comment)

@ryansheabla ryansheabla added Needs Triage Issue requires triage New Feature labels Mar 18, 2025
@StefanieSenger
Copy link
Contributor

StefanieSenger commented Mar 20, 2025

Hi @ryansheabla,

scikit-learn can accept pandas or polars (only most recent versions), but narwhals dataframes are not passable and you need to call .to_native() on them first (if they are natively pandas or polars).

I think narwhals is to be used internally by libraries and users then could reliably pass polars, pandas or pyarrow frames.

This said, scikit-learn is not using dataframes internally a lot. Input validation, especially check_array() converts user inputs into numpy arrays and some information like feature_names_in_ are stored as attributes on the objects and passed to the next object if several operations are lined up. Also, the output of predict methods or transformers are numpy arrays by default and users can set a different output type using set_output().

So the design - if scikit-learn would decide to use narwhals, which has not been discussed yet - would be something like a) store attributes from inputs and if the user has requested to have a different return type than numpy with set_output(): b) store input types, c) return predictions or transformed data in same namespace.

Narwhals could be used in steps b) and c), but maybe in step a) it is not necessary to have it.

@MarcoGorelli
Copy link
Contributor

Thanks @ryansheabla for opening the request, and Stefanie for your reply! 🙏

I think narwhals is to be used internally by libraries and users then could reliably pass polars, pandas or pyarrow frames.

True, but I think here @ryansheabla is making a library which internally uses Narwhals and would like to pass that around.

Also, although Narwhals supports PyArrow tables, scikit-learn doesn't #25896 (comment)

Even though Narwhals was originally designed with tool builders in mind, I've anecdotally been hearing from users working with it directly as a friendly and unified interface to different engines

I don't think it would be too much of a lift to generalise the Polars code in scikit-learn to also handle Narwhals input, given that the API is very similar. Happy to work towards this if you'd be open to it! 🙌

@ryansheabla
Copy link
Author

From what I can tell scikit-learn does not have a problem converting narwhals DataFrames to numpy arrays, given my provided example runs except for the AttributeError. Stepping through the fit methods, it seems to follow the same path through check_array as both pandas (1.5.3) and polars (1.24.0) and has no problem converting to numpy datatypes (though I wouldn't be surprised if there might be other issues on more a more complex example).

I added the changes to the code I outlined above in my virtual environment and the code runs as expected. You can even pass pandas/polars frames to the narwhals-fitted scaler and vice-versa. I'd be willing to open a PR but this feels like it's part of a larger discussion, especially if there's a possibility of a).

@adrinjalali
Copy link
Member

Closing as a duplicate for #31049 and #25896

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

4 participants