Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OneHotEncoder should ignore NaNs outside categorical_features #8540

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
e-dorigatti opened this issue Mar 6, 2017 · 5 comments · Fixed by #9012
Closed

OneHotEncoder should ignore NaNs outside categorical_features #8540

e-dorigatti opened this issue Mar 6, 2017 · 5 comments · Fixed by #9012

Comments

@e-dorigatti
Copy link

Description

Suppose you have a dataset with both categorical and numerical features, where the numerical features have NaNs, and want to one-hot-encode the categorical features (which do not contain NaNs). This is not possible, as the OneHotEncoder raises a ValueError,

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = np.array([[1,2,np.nan], [2,1,0]])
OneHotEncoder(categorical_features=[0]).fit_transform(X)

Expected Results

array([[   1.,    0.,    2.,  nan],
       [   0.,    1.,    1.,    0.]])

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)
  File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
  File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "C:\Users\s166979\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

Windows-10-10.0.14393-SP0
Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.11.3
SciPy 0.18.1
Scikit-Learn 0.18.1
@amueller
Copy link
Member

amueller commented Mar 6, 2017

That would probably be good.
I think we should really think about how we want to approach a pipeline with imputation and one-hot encoding. It's a very common setting but we don't really have a good solution for it, also see #6967 and #8539.

We probably want imputation to be different on continuous and categorical data.
I think imputation first and one-hot-encoding later is the way to go, which would make your issues somewhat of a non-issue [which we could still fix though].

Would that also be an acceptable solution for you? Or is there a particular reason you want to do one-hot-coding first and then imputation?

@e-dorigatti
Copy link
Author

e-dorigatti commented Mar 6, 2017

The reason why I did one-hot encoding first is exactly because of #6967, and the fact that I had NaNs in both categorical and numeric features. It did not occur to me, that I can first remove NaNs in the categorical features, then apply the imputer, and finally the one-hot encoder. It's a bit convoluted, but it works.

I might try to submit a pull-request myself in the next few days, if you think this is easy enough for a beginner in scikit.

@tobiaspiechowiak
Copy link

Actually, var.dropna() on the pd data frame worked fine for me to solve the problem....

@fryasdf
Copy link

fryasdf commented Aug 28, 2019

I do not understand, why has this been closed? I am using this in late summer of 2019 and this error still seems to be there. Also please be careful: you should NOT simply dropna, that does not solve the problem but it makes you blind w.r.t. the symptoms. From a Data Science perspective it may be really harmful to remove NAs (because the absence of information might be an important information for the model!).

Steps/Code to Reproduce

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
one_hot_encoder = OneHotEncoder(handle_unknown='ignore')
df = pd.DataFrame({'boolean_as_string':["True", "False", None]})
one_hot_encoder.fit(df)

import math
df = pd.DataFrame({'numeric':[0.1, 0.2, None, math.nan, np.nan]})
one_hot_encoder.fit(df)

Expected Results
in the first case:
pd.DataFrame({'boolean_as_string_TRUE':[1, 0, 0], 'boolean_as_string_FALSE':[0, 1, 0]})
(boolean column is being one hot encoded and NA values will result in all 0)

and in the second case:
pd.DataFrame({'numeric':[0.1, 0.2, None, math.nan, np.nan]})
(one hot encoder should not touch the column as it is numeric already)

Versions
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
scikit-learn: 0.21.2-py37h27c97d8_0

@jnothman
Copy link
Member

This is closed because ColumnTransformer can be used to fully disregard columns not being transformed by the OneHotEncoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants