-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support other dataframes like polars and pyarrow not just pandas #25896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
While scikit-learn does not currently support Polars or Pyarrow dataframes out-of-the-box, there are some possible workarounds to use these dataframes with scikit-learn. One possible solution would be to convert the Polars or Pyarrow dataframe to a Pandas dataframe before passing it to scikit-learn's import polars as pl
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# Load data into a Polars dataframe
X_pl = pl.DataFrame({...})
# Convert Polars dataframe to Pandas dataframe
X_pd = X_pl.to_pandas()
# Create ColumnTransformer
preprocessor = ColumnTransformer(
[
("scaler", StandardScaler(), ["sepal length (cm)", "sepal width (cm)"]),
]
)
# Fit and transform using ColumnTransformer
X_transformed = preprocessor.fit_transform(X_pd) Another possible solution would be to write a custom transformer that can directly handle Polars or Pyarrow dataframes. This transformer would need to implement the import polars as pl
from sklearn.base import BaseEstimator, TransformerMixin
class PolarsTransformer(BaseEstimator, TransformerMixin):
def __init__(self, pl_transformer):
self.pl_transformer = pl_transformer
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_pl = pl.from_pandas(X)
X_transformed_pl = self.pl_transformer.fit_transform(X_pl)
X_transformed_pd = X_transformed_pl.to_pandas()
return X_transformed_pd With this custom transformer, you can pass it directly to scikit-learn's import polars as pl
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# Load data into a Polars dataframe
X_pl = pl.DataFrame({...})
# Create PolarsTransformer
preprocessor = ColumnTransformer(
[
("scaler", PolarsTransformer(StandardScaler()), ["sepal length (cm)", "sepal width (cm)"]),
]
)
# Fit and transform using ColumnTransformer
X_transformed = preprocessor.fit_transform(X_pl) |
We definitely should fix this, I'm not sure if @thomasjpfan already has plans for it. |
I think it would make a lot of sense to support other popular data frames, especially if they support the data frame protocol.
If people have plans to work on things like this, it would be great to share them before they start working on it. Seems like a good opportunity to get collaboration going. |
I see three features with dataframes + a default option. TLDR: The engineering to get other DataFrames to work is doable. Implementation-wise, I prefer to lean as much as we can on the DataFrame exchange protocol. Support other dataframes in
|
@thomasjpfan Hello Thomas, has this matter been left for further discussion? Am I permitted to take it? |
@jiawei-zhang-a this is by far not a good first issue, and we need to discuss further. I suggest other simpler issues to start with. But happy that you're looking to contribute here :) |
@adrinjalali Your words are greatly appreciated, and I am excited at the opportunity to contribute to the project. Thank you for your encouragement! |
Or do we magically convert internally to pandas? If we have a full pipeline with a predictor at the end, then I don't find it too much of a hassle. If we have a |
Until data-apis/dataframe-api#42 is decided, could we at least support the ones with Or could we use https://github.com/apache/arrow-nanoarrow to support arrow arrays in general? |
As an FYI, it looks like VegaFusion just took the interchange approach for Polars integration; consequently they got Vaex, pyarrow Tables, cuDF, and Polars working with the same update, which seems like good bang for the buck 🤔 https://vegafusion.io/posts/2023/2023-03-25_Release_1.1.0.html |
Now that we have more or less the infrastructure for it, we shouldn't be too shy of supporting these. |
@lorentzenchr do you have some example code or link to something that shows how people use duckdb and scikt-learn now? a super quick google got me to https://duckdb.org/docs/api/python/overview.html#result-conversion which is a bit too basic(?). I'd like to see what some real world(ish) code looks like today. |
For libraries that implement the dataframe exchange protocol, a workaround to support other DataFrame input in import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
import polars as pl
from pandas.api.interchange import from_dataframe
X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]
X_pl = pl.from_pandas(X)
preprocessor = make_pipeline(
FunctionTransformer(from_dataframe, feature_names_out="one-to-one"),
ColumnTransformer(
[
("scaler", StandardScaler(), sepal_cols),
("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
],
verbose_feature_names_out=False)
)
preprocessor.set_output(transform="pandas")
preprocessor.fit_transform(X_pl)
I opened #26115 as an implementation of this idea. As an update, the Polars |
I think supporting other dataframes via |
They simply convert to pandas before passing the data to My personal summary:
|
Thanks. I wasn't sure if it was as simple as that or not. Don't think we need an example. |
Here are my thoughts since I work with all the dataframe libraries above, Spark and other frameworks.. I'll just list the PROs only.. substrait.io plan custom transformer transformer |
I think most of the work is done for polars. But the We also would need |
Maybe we could have one such issue per-dataframe libraries we want to support, either for input only or input/output (e.g. at least pyarrow I think). |
FYI, the above code snippet now works, I guess since #26464. So I'm inclined to close. |
It works for Polars, but not for PyArrow, right? At least: import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]
preprocessor = ColumnTransformer(
[
("scaler", StandardScaler(), sepal_cols),
("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
],
verbose_feature_names_out=False,
)
import pyarrow as pa
X_pa = pa.table(X)
preprocessor.fit_transform(X_pa) raises ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[2], line 21
18 import pyarrow as pa
19 X_pa = pa.table(X)
---> 21 preprocessor.fit_transform(X_pa)
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_set_output.py:319, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
317 @wraps(f)
318 def wrapped(self, X, *args, **kwargs):
--> 319 data_to_wrap = f(self, X, *args, **kwargs)
320 if isinstance(data_to_wrap, tuple):
321 # only wrap the first output for cross decomposition
322 return_tuple = (
323 _wrap_data_with_container(method, data_to_wrap[0], X, self),
324 *data_to_wrap[1:],
325 )
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/base.py:1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1382 estimator._validate_params()
1384 with config_context(
1385 skip_parameter_validation=(
1386 prefer_skip_nested_validation or global_skip_validation
1387 )
1388 ):
-> 1389 return fit_method(estimator, *args, **kwargs)
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:1001, in ColumnTransformer.fit_transform(self, X, y, **params)
998 else:
999 routed_params = self._get_empty_routing()
-> 1001 result = self._call_func_on_transformers(
1002 X,
1003 y,
1004 _fit_transform_one,
1005 column_as_labels=False,
1006 routed_params=routed_params,
1007 )
1009 if not result:
1010 self._update_fitted_transformers([])
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:902, in ColumnTransformer._call_func_on_transformers(self, X, y, func, column_as_labels, routed_params)
897 else: # func is _transform_one
898 extra_args = {}
899 jobs.append(
900 delayed(func)(
901 transformer=clone(trans) if not fitted else trans,
--> 902 X=_safe_indexing(X, columns, axis=1),
903 y=y,
904 weight=weight,
905 **extra_args,
906 params=routed_params[name],
907 )
908 )
910 return Parallel(n_jobs=self.n_jobs)(jobs)
912 except ValueError as e:
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_indexing.py:270, in _safe_indexing(X, indices, axis)
268 return _polars_indexing(X, indices, indices_dtype, axis=axis)
269 elif hasattr(X, "shape"):
--> 270 return _array_indexing(X, indices, indices_dtype, axis=axis)
271 else:
272 return _list_indexing(X, indices, indices_dtype)
File ~/scratch/.310venv/lib/python3.10/site-packages/sklearn/utils/_indexing.py:36, in _array_indexing(array, key, key_dtype, axis)
34 if isinstance(key, tuple):
35 key = list(key)
---> 36 return array[key, ...] if axis == 0 else array[:, key]
File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1693, in pyarrow.lib._Tabular.__getitem__()
File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1779, in pyarrow.lib._Tabular.column()
File ~/scratch/.310venv/lib/python3.10/site-packages/pyarrow/table.pxi:1725, in pyarrow.lib._Tabular._ensure_integer_index()
TypeError: Index must either be string or integer Given that the original issue also mentioned PyArrow, may I suggest either reopening until PyArrow support is completed, or making a separate issue for PyArrow support? Just to avoid ambiguity: I'm not requesting that PyArrow be required in scikit-learn (far from it!), but that related issue: #31019 |
@scikit-learn/core-devs Should we make pyarrow tables work within scikit-learn (without requiring it as dependency, just like pandas and polars)? |
Yes that would be great imo. I need to look into it more, but are there any major API incompatibilities? |
Not that I know of. The API calls like |
I meant on Arrow side to operate internal to fit, predict, etc. Some related discussion: #25450 Would we use the dataframe interchange protocol? https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#stakeholders |
I'm not sure about the dataframe interchange protocol really. I'd need to see what @MarcoGorelli thinks about it. At some point in order to support multiple dataframe like objects, we better simply use |
I think the dataframe interchange protocol, at least the one that is similar to array API, is not going to get wide spread adoption. At least that is my impression. |
There are several different things to fix for an implementation:
|
You could go full pyarrow with pyarrow dataset instead of pyarrow table. Leveraging pyarrow compute to apply calculations is pretty powerful when backed by GPUs. |
Since the dataframe interchange API is unlikely to become widely adopted and feature rich enough for scikit-learn use cases, I wouldn't mind considering the inclusion of I would still keep custom code to support pandas with narwhals in the short to medium term, to avoid introducing a new dependency to the pandas users, though. |
I'd be okay adding narwhals as a dependency since it's a very lightweight dependency and doesn't bring any transient dependencies. However, I don't mind having two paths for now, for pandas, and others, while making sure we do NOT maintain the pandas path too much, and just leave it as is for now and mostly maintain the narwhals path. |
Pyarrow is used in both pandas, polars and cudf (RAPIDS), making it a good choice of interface for scikit-learn. Importing narwhals is better than rebuilding the wheel in short term but additional dependency may sometime cause trouble. |
Describe the workflow you want to enable
Currently, scikit-learn nowhere claims to support pyarrow or polars. And indeed,
errors with
Describe your proposed solution
scikit-learn should support those dataframes, maybe via the python dataframe interchange protocol.
In that regard, a new option like
set_output(transform="dataframe")
would be nice.Describe alternatives you've considered, if relevant
No response
Additional context
Some related discussion came up in #25813.
The text was updated successfully, but these errors were encountered: