Description
Description
The title kind of describes it. It might be pretty logical, but just putting it out here as it took a while for me to realize and debug what exactly happened.
The SimpleImputer has the ability to impute missing values with a constant. If the data is categorical, it is possible to impute with a string value. However, when fetching a dataset from OpenML (or many other datasets from different sources) the data is encoded numerically automatically as numeric. When applying the SimpleImputer and a string value, scikit-learn crashes. I assume there's not a lot that can be done about this, as everything behaves exactly as you would expect when you dive deep into the code, but maybe the documentation can be extended a little bit (probably on SimpleImputer side, or maybe on the side of the data sources).
What do you think?
Steps/Code to Reproduce
import numpy as np
import sklearn.datasets
import sklearn.compose
import sklearn.tree
import sklearn.impute
X, y = sklearn.datasets.fetch_openml('Australian', 4, return_X_y=True)
numeric_imputer = sklearn.impute.SimpleImputer(strategy='mean')
numeric_scaler = sklearn.preprocessing.StandardScaler()
nominal_imputer = sklearn.impute.SimpleImputer(strategy='constant', fill_value='missing')
nominal_encoder = sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore')
numeric_idx = [1, 2, 7, 10, 13]
nominal_idx = [0, 3, 4, 5, 6, 8, 9, 11, 12]
print('missing numeric vals:', np.count_nonzero(~np.isnan(X[:, numeric_idx])))
print('missing nominal vals:', np.count_nonzero(~np.isnan(X[:, nominal_idx])))
clf_nom = sklearn.pipeline.make_pipeline(nominal_imputer, nominal_encoder)
clf_nom.fit(X[:, nominal_idx], y)
Expected Results
A fitted classifier? Depending on how you write the documentation, the current error could also be the expected result.
Actual Results
missing numeric vals: 3450
missing nominal vals: 6210
Traceback (most recent call last):
File "/home/janvanrijn/projects/sklearn-bot/testjan.py", line 23, in <module>
clf_nom.fit(X[:, nominal_idx], y)
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 265, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 230, in _fit
**fit_params_steps[name])
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 329, in __call__
return self.func(*args, **kwargs)
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/pipeline.py", line 614, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/base.py", line 465, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "/home/janvanrijn/anaconda3/envs/sklearn-bot/lib/python3.6/site-packages/sklearn/impute.py", line 241, in fit
"data".format(fill_value))
ValueError: 'fill_value'=missing is invalid. Expected a numerical value when imputing numerical data
Versions
Python=3.6.0
numpy==1.15.2
scikit-learn==0.20.0
scipy==1.1.0