-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
API Forbid pd.NA in ColumnTransformer output unless transform output is configured as "pandas" #27734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Forbid pd.NA in ColumnTransformer output unless transform output is configured as "pandas" #27734
Conversation
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I do not understand why the following line is not covered by the CI. Could you please investigate what's wrong?
|
TODO:
|
This reverts commit 1313a0c.
ogrisel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If merged on time this should be backported to 1.4.0 or 1.4.1.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I pushed a small commit to order the changelog and addressing the comment of @ogrisel.
|
I enable the auto-merge. |
…is configured as "pandas" (scikit-learn#27734) Co-authored-by: Guillaume Lemaitre <[email protected]>
…is configured as "pandas" (#27734) Co-authored-by: Guillaume Lemaitre <[email protected]>
…is configured as "pandas" (scikit-learn#27734) Co-authored-by: Guillaume Lemaitre <[email protected]>
Reference Issues/PRs
Closes #27482
What does this implement/fix? Explain your changes.
The
ColumnTransformer(withoutset_output(transform="pandas")), when it transforms pandas columns that contain pd.NA, outputs arrays of dtype object that contain pd.NA, which can cause subsequent estimators in a pipeline to fail becausecheck_array(dtype="numeric")raises a TypeError on such arrays.#27671 looked into fixing the problem by converting
pd.NAtonp.nanby applyingcheck_arrayon transformer outputs, but this causes other issues when the outputs have mixed dtypes so it was decided during the scikit-learn developers meeting that instead we should raise an error when the output of some transformer containspd.NAand the output is not set to "pandas". Therefore this PR supersedes #27671Here we first check the dtypes of the output to only search for
pd.NAin columns that have a chance of containing any.ATM this PR only searches for
pd.NAin dataframes; I'm not sure if we should check numpy arrays (of dtype object) and sparse arrays and matrices as well