Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support other dataframes like polars and pyarrow not just pandas #25896

Closed
@lorentzenchr

Description

@lorentzenchr

Describe the workflow you want to enable

Currently, scikit-learn nowhere claims to support pyarrow or polars. And indeed,

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer
from sklearn.compose import ColumnTransformer

X, y = load_iris(as_frame=True, return_X_y=True)
sepal_cols = ["sepal length (cm)", "sepal width (cm)"]
petal_cols = ["petal length (cm)", "petal width (cm)"]

preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), sepal_cols),
        ("kbin", KBinsDiscretizer(encode="ordinal"), petal_cols),
    ],
    verbose_feature_names_out=False,
)

import polars as pl  # or import pyarrow as pa
X_pl = pl.from_pandas(X)  # or X_pa = pa.table(X)

preprocessor.fit_transform(X_pl)
# preprocessor.set_output(transform="pandas").fit_transform(X_pl)

errors with

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Describe your proposed solution

scikit-learn should support those dataframes, maybe via the python dataframe interchange protocol.

In that regard, a new option like set_output(transform="dataframe") would be nice.

Describe alternatives you've considered, if relevant

No response

Additional context

Some related discussion came up in #25813.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions