Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SimpleImputer to Crash on Constant Imputation with string value when dataset is encoded Numerically #12306

Closed
@janvanrijn

Description

@janvanrijn

Description

The title kind of describes it. It might be pretty logical, but just putting it out here as it took a while for me to realize and debug what exactly happened.

The SimpleImputer has the ability to impute missing values with a constant. If the data is categorical, it is possible to impute with a string value. However, when fetching a dataset from OpenML (or many other datasets from different sources) the data is encoded numerically automatically as numeric. When applying the SimpleImputer and a string value, scikit-learn crashes. I assume there's not a lot that can be done about this, as everything behaves exactly as you would expect when you dive deep into the code, but maybe the documentation can be extended a little bit (probably on SimpleImputer side, or maybe on the side of the data sources).

What do you think?

Steps/Code to Reproduce

import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute

X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)

numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()

nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')

numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]

print('missing numeric vals:', np.count_nonzero(~np.isnan(X[:, numeric_idx])))
print('missing nominal vals:', np.count_nonzero(~np.isnan(X[:, nominal_idx])))


clf_nom = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
clf_nom.fit(X[:, nominal_idx], y)

Expected Results

A fitted classifier? Depending on how you write the documentation, the current error could also be the expected result.

Actual Results

missing numeric vals: 3450
missing nominal vals: 6210
Traceback (most recent call last):
  File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 23, in <module>
    clf_nom.fit(X[:, nominal_idx], y)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
    **fit_params_steps[name])
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
    return self.func(*args, **kwargs)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 465, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/impute.py", line 241, in fit
    "data".format(fill_value))
ValueError: 'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data

Versions

Python=3.6.0
numpy==1.15.2
scikit-learn==0.20.0
scipy==1.1.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions