-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Description
The following code snippet lead to some surprises:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
X, y = load_iris(return_X_y=True)
X[:, 0] = np.nan
imputer = SimpleImputer(keep_empty_features=False, strategy="constant", fill_value=1)
X_trans = imputer.fit_transform(X)
assert X_trans.shape[1] == 3, f"X_trans contains {X.shape[1]} columns"
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[19], line 11
8 imputer = SimpleImputer(keep_empty_features=False, strategy="constant", fill_value=1)
9 X_trans = imputer.fit_transform(X)
---> 11 assert X_trans.shape[1] == 3, f"X_trans contains {X.shape[1]} columns"
AssertionError: X_trans contains 4 columns
Apparently this is something that we really wanted for backward compatibility when merging #24770:
scikit-learn/sklearn/impute/tests/test_impute.py
Lines 1670 to 1692 in c91528c
@pytest.mark.parametrize("array_type", ["array", "sparse"]) | |
@pytest.mark.parametrize("keep_empty_features", [True, False]) | |
def test_simple_imputer_constant_keep_empty_features(array_type, keep_empty_features): | |
"""Check the behaviour of `keep_empty_features` with `strategy='constant'. | |
For backward compatibility, a column full of missing values will always be | |
fill and never dropped. | |
""" | |
X = np.array([[np.nan, 2], [np.nan, 3], [np.nan, 6]]) | |
X = _convert_container(X, array_type) | |
fill_value = 10 | |
imputer = SimpleImputer( | |
strategy="constant", | |
fill_value=fill_value, | |
keep_empty_features=keep_empty_features, | |
) | |
for method in ["fit_transform", "transform"]: | |
X_imputed = getattr(imputer, method)(X) | |
assert X_imputed.shape == X.shape | |
constant_feature = ( | |
X_imputed[:, 0].toarray() if array_type == "sparse" else X_imputed[:, 0] | |
) | |
assert_array_equal(constant_feature, fill_value) |
Now, I'm wondering if we should not deprecate this behaviour since the parameter keep_empty_feature
allows to control whether or not we should drop the feature entirely.
So I would propose to warn for a change of behaviour when strategy="constant"
, keep_empty_feature=False
, and that we detect that we have empty feature(s).