Closed
Description
Description
Suppose you have a dataset with both categorical and numerical features, where the numerical features have NaNs, and want to one-hot-encode the categorical features (which do not contain NaNs). This is not possible, as the OneHotEncoder raises a ValueError,
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = np.array([[1,2,np.nan], [2,1,0]])
OneHotEncoder(categorical_features=[0]).fit_transform(X)
Expected Results
array([[ 1., 0., 2., nan],
[ 0., 1., 1., 0.]])
Actual Results
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
self.categorical_features, copy=True)
File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
_assert_all_finite(array)
File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Versions
Windows-10-10.0.14393-SP0
Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.11.3
SciPy 0.18.1
Scikit-Learn 0.18.1
Metadata
Metadata
Assignees
Labels
No labels