Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jeromedockes
Copy link
Contributor

Reference Issues/PRs

fixes #27482

What does this implement/fix? Explain your changes.

The ColumnTransformer (without set_output(transform="pandas")), when it transforms pandas columns that contain pd.NA, outputs arrays of dtype object that contain pd.NA, which can cause subsequent estimators in a pipeline to fail because check_array(dtype="numeric") raises a TypeError on such arrays.

The proposal here is to apply check_array(dtype=None) on the transformers' outputs before performing the horizontal stacking. This will convert pd.Float64 to np.float64 and pd.NA to np.nan

Any other comments?

Another option is to forbid pd.NA in the individual transformers' output if the ColumnTransformer does not have its output transform set to "pandas". Before performing the hstack we can check for pd.NA and raise if any are found.

This still has the drawback that the output for pd.Float64 columns (without missing values) will be object rather than np.float64

@github-actions
Copy link

github-actions bot commented Oct 26, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4fc9422. Link to the linter CI: here

@glemaitre glemaitre changed the title apply check_array on outputs of ColumnTransformer FIX apply check_array before stacking in ColumnTransformer Oct 30, 2023
@glemaitre
Copy link
Member

Could you add an entry to changelog just to make the CI happy :)

@jeromedockes jeromedockes marked this pull request as draft October 30, 2023 13:15
@jeromedockes
Copy link
Contributor Author

jeromedockes commented Oct 30, 2023

applying check_array like this to outputs is not a solution because it will fail on dataframes that mix numeric and string columns, trying to convert strings to floats

using the current state of this PR:

import pandas as pd
from sklearn.compose import make_column_transformer

X = pd.DataFrame({"A": [0.5, 1.3], "B": ['a', 'b']}).convert_dtypes()
transformer = make_column_transformer(("passthrough", ["A", "B"]))
X2 = transformer.fit_transform(X)
ValueError: could not convert string to float: 'a'

@jeromedockes
Copy link
Contributor Author

In the developers meeting it was decided that instead of this approach we should raise an error when transformers' outputs contain pd.NA and output is not configured as "pandas", so I am closing this PR in favor of #27734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ColumnTransformer converts pandas extension datatypes to object

2 participants