-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
MNT Clean-up deprecations for 1.8: Imputer drops empty feature when keep_empty_features=False even if strategy='constant' #32266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…='constant' and keep_empty_features=False
CI failing here @FrancoisPgm |
Yes the failure is related to what I mentioned in the comment, |
# TODO(1.8): Remove FutureWarning and add `np.nan` as a statistic | ||
# for empty features to drop them later. | ||
if not self.keep_empty_features and ma.getmask(masked_X).all(axis=0).any(): | ||
warnings.warn( | ||
"Currently, when `keep_empty_feature=False` and " | ||
'`strategy="constant"`, empty features are not dropped. ' | ||
"This behaviour will change in version 1.8. Set " | ||
"`keep_empty_feature=True` to preserve this behaviour.", | ||
FutureWarning, | ||
) | ||
|
||
# for constant strategy, self.statistcs_ is used to store | ||
# fill_value in each column | ||
return np.full(X.shape[1], fill_value, dtype=X.dtype) | ||
statistics = np.full(X.shape[1], fill_value, dtype=X.dtype) | ||
|
||
if not self.keep_empty_features: | ||
for i in range(masked_X.shape[1]): | ||
if ma.getmask(masked_X[:, i]).all(): | ||
statistics[i] = np.nan | ||
|
||
return statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the issue where np.nan
is added to statistics
.
if not self.keep_empty_features: | ||
for i in range(missing_mask.shape[1]): | ||
if all(missing_mask[:, i].data): | ||
statistics[i] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this works, but the dense case doesn't, is that here statistics
is created as a float array, whereas there the array is created with the same dtype as the input, and then putting np.nan
(which is a float) into the int array fails.
I think we should have probably a dtype object for statistics, and then convert the type before putting it into X when modifying it instead.
…ibute declared during fit to convert the fill values during transform
Reference Issues/PRs
Deprecation clean up for #29950.
What does this implement/fix? Explain your changes.
Remove the behaviour that made Imputers not drop empty features when
strategy='constant'
, even whenkeep_empty_features
is set toFalse
. Nowkeep_empty_features=False
makes the Imputer drop empty features in all cases.Any other comments?
I followed the indications left in the
TODO(1.8)
comments, but inSimpleImputer._dense_fit
it says to putnp.nan
in thestatistic
in the empty features dimensions so they can get dropped later, however thestatistic
is a numpy array with adtype
corresponding to X, andnp.nan
is a float, so it can't be inserted in int arrays. So right now I have a test failing for integer arrays. I'm not sure about the best strategy to get around this issue.