Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support nullable pandas dtypes in LabelBinarizer #25637

@tamargrey

Description

@tamargrey

Describe the workflow you want to enable

I would like to be able to pass the nullable pandas dtypes ("Int64", "Float64", "boolean") into sklearn's LabelBinarizer. Because the dtypes become object dtype when converted to numpy arrays we get ValueError: Unknown label type::

Repro with sklearn 1.2.1:

    import pandas as pd
    import pytest
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["Int64", "Float64", "boolean"]:

        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        with pytest.raises(ValueError, match="Unknown label type:"):
            lb.fit(y_true.unique())

Describe your proposed solution

We should get the same behavior as when int64, float64, and bool dtypes are used, which is no error:

    import pandas as pd
    from sklearn.preprocessing import LabelBinarizer

    for dtype in ["int64", "float64", "bool"]:
        y_true = pd.Series([1, 0, 0, 1, 0, 1, 1, 0, 1], dtype=dtype)

        lb = LabelBinarizer()

        lb.fit(y_true.unique())
        y_one_hot_true = lb.transform(y_true)

Describe alternatives you've considered, if relevant

Our current workaround is to convert the data to numpy arrays with the corresponding dtype that works prior to passing it into LabelBinarizer

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions