Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SimpleImputer does not drop a column full of np.nan even when keep_empty_feature=False #29827

@glemaitre

Description

@glemaitre

The following code snippet lead to some surprises:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y=True)
X[:, 0] = np.nan

imputer = SimpleImputer(keep_empty_features=False, strategy="constant", fill_value=1)
X_trans = imputer.fit_transform(X)

assert X_trans.shape[1] == 3, f"X_trans contains {X.shape[1]} columns"
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[19], line 11
      8 imputer = SimpleImputer(keep_empty_features=False, strategy="constant", fill_value=1)
      9 X_trans = imputer.fit_transform(X)
---> 11 assert X_trans.shape[1] == 3, f"X_trans contains {X.shape[1]} columns"

AssertionError: X_trans contains 4 columns

Apparently this is something that we really wanted for backward compatibility when merging #24770:

@pytest.mark.parametrize("array_type", ["array", "sparse"])
@pytest.mark.parametrize("keep_empty_features", [True, False])
def test_simple_imputer_constant_keep_empty_features(array_type, keep_empty_features):
"""Check the behaviour of `keep_empty_features` with `strategy='constant'.
For backward compatibility, a column full of missing values will always be
fill and never dropped.
"""
X = np.array([[np.nan, 2], [np.nan, 3], [np.nan, 6]])
X = _convert_container(X, array_type)
fill_value = 10
imputer = SimpleImputer(
strategy="constant",
fill_value=fill_value,
keep_empty_features=keep_empty_features,
)
for method in ["fit_transform", "transform"]:
X_imputed = getattr(imputer, method)(X)
assert X_imputed.shape == X.shape
constant_feature = (
X_imputed[:, 0].toarray() if array_type == "sparse" else X_imputed[:, 0]
)
assert_array_equal(constant_feature, fill_value)

Now, I'm wondering if we should not deprecate this behaviour since the parameter keep_empty_feature allows to control whether or not we should drop the feature entirely.

So I would propose to warn for a change of behaviour when strategy="constant", keep_empty_feature=False, and that we detect that we have empty feature(s).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions