Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support other dataframes like polars and pyarrow not just pandas #25896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Mar 17, 2023 · 39 comments
Open

Support other dataframes like polars and pyarrow not just pandas #25896

lorentzenchr opened this issue Mar 17, 2023 · 39 comments

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 17, 2023

Describe the workflow you want to enable

Currently, scikit-learn nowhere claims to support pyarrow or polars. And indeed,

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer

X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]

preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), sepal_cols),
        ("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
    ],
    verbose_feature_names_out=False,
)

import polars as pl  # or import pyarrow as pa
X_pl = pl.from_pandas(X)  # or X_pa = pa.table(X)

preprocessor.fit_transform(X_pl)
# preprocessor.set_output(transform="pandas").fit_transform(X_pl)

errors with

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Describe your proposed solution

scikit-learn should support those dataframes, maybe via the python dataframe interchange protocol.

In that regard, a new option like set_output(transform="dataframe") would be nice.

Describe alternatives you've considered, if relevant

No response

Additional context

Some related discussion came up in #25813.

@lorentzenchr lorentzenchr added New Feature Needs Triage Issue requires triage labels Mar 17, 2023
@Vishal-sys-code
Copy link

While scikit-learn does not currently support Polars or Pyarrow dataframes out-of-the-box, there are some possible workarounds to use these dataframes with scikit-learn.

One possible solution would be to convert the Polars or Pyarrow dataframe to a Pandas dataframe before passing it to scikit-learn's ColumnTransformer. This can be done using the to_pandas() method in Polars or pa.Table.to_pandas() method in Pyarrow.

import polars as pl
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Load data into a Polars dataframe
X_pl = pl.DataFrame({...})

# Convert Polars dataframe to Pandas dataframe
X_pd = X_pl.to_pandas()

# Create ColumnTransformer
preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), ["sepal length (cm)", "sepal width (cm)"]),
    ]
)

# Fit and transform using ColumnTransformer
X_transformed = preprocessor.fit_transform(X_pd)

Another possible solution would be to write a custom transformer that can directly handle Polars or Pyarrow dataframes. This transformer would need to implement the fit_transform() method and should be compatible with scikit-learn's ColumnTransformer.

import polars as pl
from sklearn.base import BaseEstimator, TransformerMixin

class PolarsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, pl_transformer):
        self.pl_transformer = pl_transformer
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_pl = pl.from_pandas(X)
        X_transformed_pl = self.pl_transformer.fit_transform(X_pl)
        X_transformed_pd = X_transformed_pl.to_pandas()
        return X_transformed_pd

With this custom transformer, you can pass it directly to scikit-learn's ColumnTransformer:

import polars as pl
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Load data into a Polars dataframe
X_pl = pl.DataFrame({...})

# Create PolarsTransformer
preprocessor = ColumnTransformer(
    [
        ("scaler", PolarsTransformer(StandardScaler()), ["sepal length (cm)", "sepal width (cm)"]),
    ]
)

# Fit and transform using ColumnTransformer
X_transformed = preprocessor.fit_transform(X_pl)

@adrinjalali
Copy link
Member

We definitely should fix this, I'm not sure if @thomasjpfan already has plans for it.

@adrinjalali adrinjalali added RFC and removed Needs Triage Issue requires triage labels Mar 21, 2023
@betatim
Copy link
Member

betatim commented Mar 21, 2023

I think it would make a lot of sense to support other popular data frames, especially if they support the data frame protocol.

I'm not sure if @thomasjpfan already has plans for it.

If people have plans to work on things like this, it would be great to share them before they start working on it. Seems like a good opportunity to get collaboration going.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 21, 2023

I see three features with dataframes + a default option.

TLDR: The engineering to get other DataFrames to work is doable. Implementation-wise, I prefer to lean as much as we can on the DataFrame exchange protocol.

Support other dataframes in ColumnTransformer as input.

If we want to support Polars directly, we need to extend ColumnTransformer to recognize it. Although its not too hard to add polars as an optional dependency, I'll prefer to use the dataframe exchange protocol to get the data out of the input DataFrame.

Support other dataframes for output in set_output.

When designing set_output, I left the API open so that we can have the following API:

def construct_polars_df(data, columns, index):
    # ignore index since polars does not have an index
    return pl.from_numpy(data, columns=columns)

# API does not work now, but not hard to enable.
transformer.set_output(transform=construct_polars_df)

The above API would configure scikit-learn to output polars DataFrames. The other piece is to get check_array to work with polars dataframes, which currently has some issues: #25813 (comment). Note that even if we get polars to work in a pipeline, it will have to go through many copies because polars <-> NumPy which is not free. Pandas does not have this issue because it can be backed by a NumPy array using pandas's BlockManager.

Generic set_output(transform="dataframe")

Assuming this means "dataframe in -> dataframe out", I think it's best to enable this with the dataframe exchange protocol when data-apis/dataframe-api#42 is decided. As with the above, we'll need to update check_array to work with the exchange protocol. If we do not want to wait for data-apis/dataframe-api#42, we can have an optional dependencies on the dataframe libraries.

Default option: Do not extend support for other DataFrames

Given that Pandas 2.0 DataFrames can be backed by arrow, Polars can now go from polars -> pandas with zero copy. As stated in #25896 (comment), one can convert the polars dataframe into a pandas one before passing it to ColumnTransformer. This gives us the option of "Do not extend support for other DataFrames and recommend converting DataFrames into pandas because the conversion is zero copy".

@jiawei-zhang-a
Copy link
Contributor

@thomasjpfan Hello Thomas, has this matter been left for further discussion? Am I permitted to take it?

@adrinjalali
Copy link
Member

@jiawei-zhang-a this is by far not a good first issue, and we need to discuss further. I suggest other simpler issues to start with. But happy that you're looking to contribute here :)

@jiawei-zhang-a
Copy link
Contributor

@adrinjalali Your words are greatly appreciated, and I am excited at the opportunity to contribute to the project. Thank you for your encouragement!

@glemaitre
Copy link
Member

This gives us the option of "Do not extend support for other DataFrames and recommend converting DataFrames into pandas because the conversion is zero copy".

Or do we magically convert internally to pandas? If we have a full pipeline with a predictor at the end, then I don't find it too much of a hassle. If we have a Pipeline which is then a transformer, then we will be requested to output the same DataFrame type as what came in.

@lorentzenchr
Copy link
Member Author

Until data-apis/dataframe-api#42 is decided, could we at least support the ones with __dataframe__ (quite many already) by means of pandas.api.interchange.from_dataframe (pandas v1.5.0). I would like to avoid that the users must call X.to_pandas().

Or could we use https://github.com/apache/arrow-nanoarrow to support arrow arrays in general?

@alexander-beedie
Copy link

alexander-beedie commented Mar 28, 2023

Until data-apis/dataframe-api#42 is decided, could we at least support the ones with __dataframe__ (quite many already) by means of pandas.api.interchange.from_dataframe (pandas v1.5.0). I would like to avoid that the users must call X.to_pandas().

As an FYI, it looks like VegaFusion just took the interchange approach for Polars integration; consequently they got Vaex, pyarrow Tables, cuDF, and Polars working with the same update, which seems like good bang for the buck 🤔

https://vegafusion.io/posts/2023/2023-03-25_Release_1.1.0.html

@adrinjalali
Copy link
Member

Now that we have more or less the infrastructure for it, we shouldn't be too shy of supporting these.

@betatim
Copy link
Member

betatim commented Mar 30, 2023

@lorentzenchr do you have some example code or link to something that shows how people use duckdb and scikt-learn now? a super quick google got me to https://duckdb.org/docs/api/python/overview.html#result-conversion which is a bit too basic(?). I'd like to see what some real world(ish) code looks like today.

@thomasjpfan
Copy link
Member

For libraries that implement the dataframe exchange protocol, a workaround to support other DataFrame input in ColumnTransformer is to have a FunctionTransformer that converts the DataFrame into a Pandas one:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
import polars as pl
from pandas.api.interchange import from_dataframe

X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]
X_pl = pl.from_pandas(X)

preprocessor = make_pipeline(
    FunctionTransformer(from_dataframe, feature_names_out="one-to-one"),
    ColumnTransformer(
    [
        ("scaler", StandardScaler(), sepal_cols),
        ("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
    ],
    verbose_feature_names_out=False)
)
preprocessor.set_output(transform="pandas")
preprocessor.fit_transform(X_pl)

Or do we magically convert internally to pandas? If we have a full pipeline with a predictor at the end, then I don't find it too much of a hassle. If we have a Pipeline which is then a transformer, then we will be requested to output the same DataFrame type as what came in.

I opened #26115 as an implementation of this idea.

As an update, the Polars np.asarray(polars_df) issue was resolved: pola-rs/polars#7961. When the bug fix is released, Polars DataFrames will work out of the box with estimators that assume homogeneous float data. I opened a similar issue for PyArrow: apache/arrow#34886.

@betatim
Copy link
Member

betatim commented Apr 11, 2023

I think supporting other dataframes via FunctionTransformer and the like feels very much like a clever hack. For the average user it is probably way to time consuming to figure out that this is the way to make it work. It probably doesn't even cross their mind that it is possible. For me this means we should work on getting to "passing a foobar-dataframe just works".

@lorentzenchr
Copy link
Member Author

Do you have some example code or link to something that shows how people use duckdb and scikt-learn now?

They simply convert to pandas before passing the data to fit (I can write some SQL-like data prep example if you like). This mean that they have to have pandas installed.

My personal summary:

  1. I'd like to make (or better a volunteer to make) the conversion to __dataframe__ supporting data objects magically work or clearly fail if pandas not installed.
  2. Further discuss other set_output options.
  3. I'm thinking a lot about an arrow native ML library...

@betatim
Copy link
Member

betatim commented Apr 13, 2023

They simply convert to pandas before passing the data to fit (I can write some SQL-like data prep example if you like). This mean that they have to have pandas installed.

Thanks. I wasn't sure if it was as simple as that or not. Don't think we need an example.

@davlee1972
Copy link

davlee1972 commented Jun 7, 2023

Here are my thoughts since I work with all the dataframe libraries above, Spark and other frameworks.. I'll just list the PROs only..

substrait.io plan
https://substrait.io/
If you can define a set of logical operations it can be execute on any "substrait" compatible dataframe / engine.
This currently includes R, Presto, Spark, Clickhouse and PyArrow. You can get native dataframe execution as the list of substrait supported dataframes / engine grows.
This also allows developers to code stuff in one library and execute in production using some other library which is more robust and scalable.

custom transformer
With regards to pypolars. Pypolars supports a lazy execution model which will look at all your transformations and optimize it.. Filters can all be moved to execute first. Aggregations can be combined, etc.. This requires everything to execute in pypolars without converting back and forth between pandas.

transformer
Ok here is a con.. Converting to pandas should be replaced with converting to arrow instead. Pandas 2.0 has added support for pyarrow columns vs numpy columns. There are just issues with numpy backed pandas like variable string columns saved in memory as dtype objects instead of real strings or no NULLs allowed in integer columns. Pandas, PyPolars, R dataframe, DuckDB, etc.. all support arrow under the hood already to move data in and out which could be processed by scikit..

@ogrisel
Copy link
Member

ogrisel commented Dec 8, 2023

I think most of the work is done for polars. But the ColumnTransformer (and maybe OrdinalEncoder and LabelEncoder) might still need work to support pyarrow properly.

We also would need .set_output(transform="pyarrow") in the transformer mixins.

@ogrisel
Copy link
Member

ogrisel commented Dec 8, 2023

Maybe we could have one such issue per-dataframe libraries we want to support, either for input only or input/output (e.g. at least pyarrow I think).

@lorentzenchr
Copy link
Member Author

FYI, the above code snippet now works, I guess since #26464. So I'm inclined to close.

@github-project-automation github-project-automation bot moved this from Discussion to Done in Dataframe interoperability Jan 10, 2025
@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Mar 20, 2025

FYI, the above code snippet now works, I guess since #26464. So I'm inclined to close.

It works for Polars, but not for PyArrow, right?

At least:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer

X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]

preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), sepal_cols),
        ("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
    ],
    verbose_feature_names_out=False,
)

import pyarrow as pa
X_pa = pa.table(X)

preprocessor.fit_transform(X_pa)

raises

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 21
     18 import pyarrow as pa
     19 X_pa = pa.table(X)
---> 21 preprocessor.fit_transform(X_pa)

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    317 @wraps(f)
    318 def wrapped(self, X, *args, **kwargs):
--> 319     data_to_wrap = f(self, X, *args, **kwargs)
    320     if isinstance(data_to_wrap, tuple):
    321         # only wrap the first output for cross decomposition
    322         return_tuple = (
    323             _wrap_data_with_container(method, data_to_wrap[0], X, self),
    324             *data_to_wrap[1:],
    325         )

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1382     estimator._validate_params()
   1384 with config_context(
   1385     skip_parameter_validation=(
   1386         prefer_skip_nested_validation or global_skip_validation
   1387     )
   1388 ):
-> 1389     return fit_method(estimator, *args, **kwargs)

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:1001, in ColumnTransformer.fit_transform(self, X, y, **params)
    998 else:
    999     routed_params = self._get_empty_routing()
-> 1001 result = self._call_func_on_transformers(
   1002     X,
   1003     y,
   1004     _fit_transform_one,
   1005     column_as_labels=False,
   1006     routed_params=routed_params,
   1007 )
   1009 if not result:
   1010     self._update_fitted_transformers([])

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:902, in ColumnTransformer._call_func_on_transformers(self, X, y, func, column_as_labels, routed_params)
    897         else:  # func is _transform_one
    898             extra_args = {}
    899         jobs.append(
    900             delayed(func)(
    901                 transformer=clone(trans) if not fitted else trans,
--> 902                 X=_safe_indexing(X, columns, axis=1),
    903                 y=y,
    904                 weight=weight,
    905                 **extra_args,
    906                 params=routed_params[name],
    907             )
    908         )
    910     return Parallel(n_jobs=self.n_jobs)(jobs)
    912 except ValueError as e:

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_indexing.py:270, in _safe_indexing(X, indices, axis)
    268     return _polars_indexing(X, indices, indices_dtype, axis=axis)
    269 elif hasattr(X, "shape"):
--> 270     return _array_indexing(X, indices, indices_dtype, axis=axis)
    271 else:
    272     return _list_indexing(X, indices, indices_dtype)

File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_indexing.py:36, in _array_indexing(array, key, key_dtype, axis)
     34 if isinstance(key, tuple):
     35     key = list(key)
---> 36 return array[key, ...] if axis == 0 else array[:, key]

File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1693, in pyarrow.lib._Tabular.__getitem__()

File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1779, in pyarrow.lib._Tabular.column()

File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1725, in pyarrow.lib._Tabular._ensure_integer_index()

TypeError: Index must either be string or integer

Given that the original issue also mentioned PyArrow, may I suggest either reopening until PyArrow support is completed, or making a separate issue for PyArrow support?

Just to avoid ambiguity: I'm not requesting that PyArrow be required in scikit-learn (far from it!), but that pyarrow.Table be supported in the same way the polars.DataFrame is


related issue: #31019

@lorentzenchr
Copy link
Member Author

@scikit-learn/core-devs Should we make pyarrow tables work within scikit-learn (without requiring it as dependency, just like pandas and polars)?

@adam2392
Copy link
Member

Yes that would be great imo. I need to look into it more, but are there any major API incompatibilities?

@lorentzenchr
Copy link
Member Author

are there any major API incompatibilities?

Not that I know of. The API calls like fit(X, y) stay the same, but we would allow for more kinds of objects X, y being passed.

@adam2392
Copy link
Member

I meant on Arrow side to operate internal to fit, predict, etc.

Some related discussion: #25450

Would we use the dataframe interchange protocol? https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#stakeholders

@adrinjalali
Copy link
Member

I'm not sure about the dataframe interchange protocol really.

I'd need to see what @MarcoGorelli thinks about it. At some point in order to support multiple dataframe like objects, we better simply use narwhals.

@betatim
Copy link
Member

betatim commented Mar 20, 2025

I think the dataframe interchange protocol, at least the one that is similar to array API, is not going to get wide spread adoption. At least that is my impression.

@lorentzenchr
Copy link
Member Author

There are several different things to fix for an implementation:

  1. set_output
    This requires a PyArrowTablesAdapter in sklearn/utils/_set_output.py. Here, we could, sooner or later, think about using narwhals.
  2. feature_names_in_, see release highlights 1.0
    This is done in sklearn/utils/validation.py and makes use of the dataframe interchange protocol which is supported by pyarrow as of version 11.0.0
  3. Internal indexing tools in sklearn/utils/_indexing.py
    This is where the error reported in Support other dataframes like polars and pyarrow not just pandas #25896 (comment) stems from, i.e. _safe_indexing. It currently uses a mix of dataframe protocol and pandas & polars specific code. Here, too, narwhals could help.

@davlee1972
Copy link

You could go full pyarrow with pyarrow dataset instead of pyarrow table.

Leveraging pyarrow compute to apply calculations is pretty powerful when backed by GPUs.

@ogrisel
Copy link
Member

ogrisel commented Mar 21, 2025

Since the dataframe interchange API is unlikely to become widely adopted and feature rich enough for scikit-learn use cases, I wouldn't mind considering the inclusion of narwhals as a soft dependency to simplify support for polars / pyarrow tables in the future.

I would still keep custom code to support pandas with narwhals in the short to medium term, to avoid introducing a new dependency to the pandas users, though.

@adrinjalali
Copy link
Member

I'd be okay adding narwhals as a dependency since it's a very lightweight dependency and doesn't bring any transient dependencies. However, I don't mind having two paths for now, for pandas, and others, while making sure we do NOT maintain the pandas path too much, and just leave it as is for now and mostly maintain the narwhals path.

@YuanfengZhang
Copy link

YuanfengZhang commented Apr 20, 2025

Pyarrow is used in both pandas, polars and cudf (RAPIDS), making it a good choice of interface for scikit-learn.
How about trying pyarrow approach and fall back to numpy if it fails?

Importing narwhals is better than rebuilding the wheel in short term but additional dependency may sometime cause trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.