Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

FrancoisPgm
Copy link
Contributor

Reference Issues/PRs

Deprecation clean up for #29950.

What does this implement/fix? Explain your changes.

Remove the behaviour that made Imputers not drop empty features when strategy='constant', even when keep_empty_features is set to False. Now keep_empty_features=False makes the Imputer drop empty features in all cases.

Any other comments?

I followed the indications left in the TODO(1.8) comments, but in SimpleImputer._dense_fit it says to put np.nan in the statistic in the empty features dimensions so they can get dropped later, however the statistic is a numpy array with a dtype corresponding to X, and np.nan is a float, so it can't be inserted in int arrays. So right now I have a test failing for integer arrays. I'm not sure about the best strategy to get around this issue.

Copy link

github-actions bot commented Sep 24, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ec95191. Link to the linter CI: here

@adrinjalali
Copy link
Member

CI failing here @FrancoisPgm

@FrancoisPgm
Copy link
Contributor Author

CI failing here @FrancoisPgm

Yes the failure is related to what I mentioned in the comment, np.nan is a float and can't be added to a int array, so the solution to deprecate the old behavior described by the instructions in the TODO comments does not work in all cases. An alternative way to mark the empty feature dimensions needs to be used. I figured I'd open the PR to have a place to discuss it.

Comment on lines 587 to 584
# TODO(1.8): Remove FutureWarning and add `np.nan` as a statistic
# for empty features to drop them later.
if not self.keep_empty_features and ma.getmask(masked_X).all(axis=0).any():
warnings.warn(
"Currently, when `keep_empty_feature=False` and "
'`strategy="constant"`, empty features are not dropped. '
"This behaviour will change in version 1.8. Set "
"`keep_empty_feature=True` to preserve this behaviour.",
FutureWarning,
)

# for constant strategy, self.statistcs_ is used to store
# fill_value in each column
return np.full(X.shape[1], fill_value, dtype=X.dtype)
statistics = np.full(X.shape[1], fill_value, dtype=X.dtype)

if not self.keep_empty_features:
for i in range(masked_X.shape[1]):
if ma.getmask(masked_X[:, i]).all():
statistics[i] = np.nan

return statistics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the issue where np.nan is added to statistics.

if not self.keep_empty_features:
for i in range(missing_mask.shape[1]):
if all(missing_mask[:, i].data):
statistics[i] = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this works, but the dense case doesn't, is that here statistics is created as a float array, whereas there the array is created with the same dtype as the input, and then putting np.nan (which is a float) into the int array fails.

I think we should have probably a dtype object for statistics, and then convert the type before putting it into X when modifying it instead.

…ibute declared during fit to convert the fill values during transform
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants