Closed
Description
Describe the workflow you want to enable
Currently, scikit-learn nowhere claims to support pyarrow or polars. And indeed,
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer
X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]
preprocessor = ColumnTransformer(
[
("scaler", StandardScaler(), sepal_cols),
("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
],
verbose_feature_names_out=False,
)
import polars as pl # or import pyarrow as pa
X_pl = pl.from_pandas(X) # or X_pa = pa.table(X)
preprocessor.fit_transform(X_pl)
# preprocessor.set_output(transform="pandas").fit_transform(X_pl)
errors with
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Describe your proposed solution
scikit-learn should support those dataframes, maybe via the python dataframe interchange protocol.
In that regard, a new option like set_output(transform="dataframe")
would be nice.
Describe alternatives you've considered, if relevant
No response
Additional context
Some related discussion came up in #25813.
Metadata
Metadata
Assignees
Type
Projects
Status
Done