Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Support pandas nullable dtypes for scoring metrics #25578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tamargrey opened this issue Feb 9, 2023 · 4 comments
Closed

Support pandas nullable dtypes for scoring metrics #25578

tamargrey opened this issue Feb 9, 2023 · 4 comments
Labels
Needs Triage Issue requires triage New Feature

Comments

@tamargrey
Copy link

Describe the workflow you want to enable

I would like to be able to pass data with the nullable pandas dtypes (Int64, Float64, and boolean) into sklearn metrics such as matthews_corrcoef, accuracy_score, and f1_score (and more) even if the data does not contain any nans. Currently, they result in one of several errors:

  • If y_true and y_pred are both nullable types: ValueError: unknown is not supported
  • it only one of y_true or y_pred is nullable and the other is non nullable : ValueError: Classification metrics can't handle a mix of unknown and binary [or multiclass] targets
  • Some metrics such as log_loss result in a different error when y_true is nullable: ValueError: Unknown label type: (0 1

Repro with sklearn 1.2.1 and pandas 1.5.3:

    import pandas as pd
    import pytest
    from sklearn import metrics

    for dtype in ['Int64', 'Float64', 'boolean']:
        # Error if only target uses nullable types
        X = pd.DataFrame({"a": pd.Series([1, 2, 3, 4]), 
                          "b": pd.Series([9,8,7,6])})

        # Two nullable dtypes used 
        y_true = pd.Series([1, 0, 1, 0], dtype=dtype)
        y_predicted = pd.Series([1, 0, 1, 0], dtype=dtype)
        with pytest.raises(ValueError, match="unknown is not supported"):
            metrics.accuracy_score(
                    y_true,
                    y_predicted,
                )

        # Only one nullable dtype used 
        y_predicted = pd.Series([1, 0, 1, 0], dtype="float64")
        with pytest.raises(ValueError, match="Classification metrics can't handle a mix of unknown and binary targets"):
            metrics.accuracy_score(
                    y_true,
                    y_predicted,
                )

Describe your proposed solution

Sklearn should recognize the pandas nullable dtypes as the correct type of target for their scoring metrics like it does with the non nullable dtypes.

Describe alternatives you've considered, if relevant

As this data doesn't have null values, we can convert to the non nullable dtype prior to passing to sklearn, but doing that will make it cumbersome to build software that leverages both the latest pandas dtypes and sklearn.

Additional context

No response

@thomasjpfan
Copy link
Member

Thank you for opening this issue! With #25638 merged, nullable dtypes are supported and the provided snippet does not raise anymore.

@tamargrey
Copy link
Author

Thank you for the quick work! Do you know if this will go out in the next patch release?

@thomasjpfan
Copy link
Member

Currently, #25638 is targeted for v1.3, but I think we can consider it for a bug patch release.

WDYT @lorentzenchr @jeremiedbb ?

@lorentzenchr
Copy link
Member

I‘m not the release manager. If we do a further bugfix release, why not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

3 participants