FIX apply check_array before stacking in ColumnTransformer #27671

jeromedockes · 2023-10-26T13:07:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The ColumnTransformer (without set_output(transform="pandas")), when it transforms pandas columns that contain pd.NA, outputs arrays of dtype object that contain pd.NA, which can cause subsequent estimators in a pipeline to fail because check_array(dtype="numeric") raises a TypeError on such arrays.

The proposal here is to apply check_array(dtype=None) on the transformers' outputs before performing the horizontal stacking. This will convert pd.Float64 to np.float64 and pd.NA to np.nan

Any other comments?

Another option is to forbid pd.NA in the individual transformers' output if the ColumnTransformer does not have its output transform set to "pandas". Before performing the hstack we can check for pd.NA and raise if any are found.

This still has the drawback that the output for pd.Float64 columns (without missing values) will be object rather than np.float64

github-actions · 2023-10-26T13:08:56Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 4fc9422. Link to the linter CI: here}

sklearn/compose/tests/test_column_transformer.py

glemaitre · 2023-10-30T10:25:59Z

Could you add an entry to changelog just to make the CI happy :)

Co-authored-by: Guillaume Lemaitre <[email protected]>

… fix-27482

jeromedockes · 2023-10-30T13:37:24Z

applying check_array like this to outputs is not a solution because it will fail on dataframes that mix numeric and string columns, trying to convert strings to floats

using the current state of this PR:

import pandas as pd
from sklearn.compose import make_column_transformer

X = pd.DataFrame({"A": [0.5, 1.3], "B": ['a', 'b']}).convert_dtypes()
transformer = make_column_transformer(("passthrough", ["A", "B"]))
X2 = transformer.fit_transform(X)

ValueError: could not convert string to float: 'a'

jeromedockes · 2023-11-06T16:03:45Z

In the developers meeting it was decided that instead of this approach we should raise an error when transformers' outputs contain pd.NA and output is not configured as "pandas", so I am closing this PR in favor of #27734

apply check_array on outputs of ColumnTransformer

db0e993

github-actions bot added the module:compose label Oct 26, 2023

fix test for pandas versions without extension dtypes

71d1e06

glemaitre reviewed Oct 30, 2023

View reviewed changes

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved

glemaitre changed the title ~~apply check_array on outputs of ColumnTransformer~~ FIX apply check_array before stacking in ColumnTransformer Oct 30, 2023

jeromedockes and others added 3 commits October 30, 2023 11:58

Update sklearn/compose/tests/test_column_transformer.py

bb64daa

Co-authored-by: Guillaume Lemaitre <[email protected]>

add whatsnew

0e6dbf9

Merge branch 'fix-27482' of github.com:jeromedockes/scikit-learn into…

4fc9422

… fix-27482

jeromedockes marked this pull request as draft October 30, 2023 13:15

jeromedockes mentioned this pull request Nov 6, 2023

API Forbid pd.NA in ColumnTransformer output unless transform output is configured as "pandas" #27734

Merged

jeromedockes closed this Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX apply check_array before stacking in ColumnTransformer #27671

FIX apply check_array before stacking in ColumnTransformer #27671

Uh oh!

jeromedockes commented Oct 26, 2023

Uh oh!

github-actions bot commented Oct 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

glemaitre commented Oct 30, 2023

Uh oh!

jeromedockes commented Oct 30, 2023 •

edited

Loading

Uh oh!

jeromedockes commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

FIX apply check_array before stacking in ColumnTransformer #27671

FIX apply check_array before stacking in ColumnTransformer #27671

Uh oh!

Conversation

jeromedockes commented Oct 26, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

glemaitre commented Oct 30, 2023

Uh oh!

jeromedockes commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromedockes commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Oct 26, 2023 •

edited

Loading

jeromedockes commented Oct 30, 2023 •

edited

Loading