Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jeromedockes
Copy link
Contributor

Reference Issues/PRs

Closes #27482

What does this implement/fix? Explain your changes.

The ColumnTransformer (without set_output(transform="pandas")), when it transforms pandas columns that contain pd.NA, outputs arrays of dtype object that contain pd.NA, which can cause subsequent estimators in a pipeline to fail because check_array(dtype="numeric") raises a TypeError on such arrays.

#27671 looked into fixing the problem by converting pd.NA to np.nan by applying check_array on transformer outputs, but this causes other issues when the outputs have mixed dtypes so it was decided during the scikit-learn developers meeting that instead we should raise an error when the output of some transformer contains pd.NA and the output is not set to "pandas". Therefore this PR supersedes #27671

Here we first check the dtypes of the output to only search for pd.NA in columns that have a chance of containing any.
ATM this PR only searches for pd.NA in dataframes; I'm not sure if we should check numpy arrays (of dtype object) and sparse arrays and matrices as well

@github-actions
Copy link

github-actions bot commented Nov 6, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ee408e0. Link to the linter CI: here

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I do not understand why the following line is not covered by the CI. Could you please investigate what's wrong?

@jeromedockes jeromedockes marked this pull request as draft December 5, 2023 09:00
@jeromedockes
Copy link
Contributor Author

jeromedockes commented Dec 5, 2023

TODO:

@jeromedockes jeromedockes marked this pull request as ready for review December 5, 2023 12:58
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If merged on time this should be backported to 1.4.0 or 1.4.1.

@ogrisel ogrisel added this to the 1.4 milestone Jan 10, 2024
@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jan 15, 2024
@glemaitre glemaitre self-requested a review January 16, 2024 15:28
@glemaitre glemaitre changed the title Forbid pd.NA in ColumnTransformer output unless transform output is configured as "pandas" API Forbid pd.NA in ColumnTransformer output unless transform output is configured as "pandas" Jan 16, 2024
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I pushed a small commit to order the changelog and addressing the comment of @ogrisel.

@glemaitre
Copy link
Member

I enable the auto-merge.

@glemaitre glemaitre enabled auto-merge (squash) January 16, 2024 15:43
@glemaitre glemaitre merged commit 059de51 into scikit-learn:main Jan 16, 2024
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jan 17, 2024
jeremiedbb pushed a commit that referenced this pull request Jan 17, 2024
…is configured as "pandas" (#27734)

Co-authored-by: Guillaume Lemaitre <[email protected]>
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:compose To backport PR merged in master that need a backport to a release branch defined based on the milestone.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ColumnTransformer converts pandas extension datatypes to object

4 participants