-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Handle missing values in OneHotEncoder #11996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jnothman , can I jump on this or would I need to wait for other core devs to share their opinions first? |
I think an initial implementation would be welcome. |
I'm trying to confirm I understand the task very well.
|
Perhaps:
A good idea might be to start by writing things other than the implementation:
|
You don't need a complete implementation to open a PR, either |
Is there any update on this issue - asking for min free in One Hot Encoder? Am I assuming correctly, that we also need to change the Can this issue really be considered as easy? For me, this looks rather complex right now |
I'll give this a try. I'll make a PR when I have enough progress. |
Thanks @baluyotraf |
I'm not sure if having the row of NaNs is worth supporting. It seems to make this much trickier as well. Given my work on dabl, right now I'm more concerned with making things possible at all than making them very easy with sklearn. What I found most annoying within this complex of things (and it's only tangentially related but not sure which issue would be the correct one, #2888 maybe?) is that I can't actually use the "constant" strategy on the categorical columns within a ColumnTransformer. |
I am also +1 for not supporting the option that would generate a row of nans, it sounds like YAGNI to me. Let's consider the following data case with a CSV file with 2 categorical columns, where one uses string labels and the other uses integer labels: >>> import pandas as pd
>>> from io import StringIO
>>> csv_content = """\
... f1,f2
... "a",0
... ,1
... "b",
... ,
... """
>>> raw_df = pd.read_csv(StringIO(csv_content))
>>> raw_df
f1 f2
0 a 0.0
1 NaN 1.0
2 b NaN
3 NaN NaN
>>> raw_df.dtypes
f1 object
f2 float64
dtype: object So by default pandas will use float64 dtype for the int-valued column so as to be able to use nan as the missing value marker. It's actually possible to use >>> from sklearn.impute import SimpleImputer
>>> imputed = SimpleImputer(strategy="constant", fill_value="missing").fit_transform(raw_df)
>>> imputed
array([['a', 0.0],
['missing', 1.0],
['b', 'missing'],
['missing', 'missing']], dtype=object) However putting string values in an otherwise float valued column is weird and causes the OneHotEncoder to crash on that column: >>> OneHotEncoder().fit_transform(imputed)
Traceback (most recent call last):
File "<ipython-input-48-04b9d558c891>", line 1, in <module>
OneHotEncoder().fit_transform(imputed)
File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 358, in fit_transform
return super().fit_transform(X, y)
File "/home/ogrisel/code/scikit-learn/sklearn/base.py", line 556, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 338, in fit
self._fit(X, handle_unknown=self.handle_unknown)
File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 86, in _fit
cats = _encode(Xi)
File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/label.py", line 114, in _encode
raise TypeError("argument must be a string or number")
TypeError: argument must be a string or number Using the debugger to see the underlying exception reveals:
One could use the column transformer to split the string valued categories from the number valued categorical columns and use suitable However from a usability standpoint it would make sense to have We could also implement the zero strategy with We also need to make sure that nan passed only at transform time (without being seen in this column at fit time) should be accepted (with the zero encoding) so that cross-validation is possible on data with just a few missing values that might end up all in the validation split by chance. |
Note that some datasets such as the Ames housing dataset from This also leads to a confusing error message |
take |
Any updates on this? I can't fit a dataset that contains 'nan' values. |
@netomenoci I recently worked on this and here is my comment on this issue: #16749 (comment) |
I am going to work on this with the goal of getting it into 0.24. |
FYI, all of the encoders in sklearn-contrib/category_encoders already have the option to https://github.com/scikit-learn-contrib/category_encoders/tree/master/category_encoders |
A minimum implementation might translate a NaN in input to a row of NaNs in output. I believe this would be the most consistent default behaviour with respect to other preprocessing tools, and with reasonable backwards-compatibility, but other core devs might disagree (see #10465 (comment)).
NaN should also be excluded from the categories identified in
fit
.A
handle_missing
parameter might allow NaN in input to be:in the output.
A
missing_values
parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).See #10465 for background
The text was updated successfully, but these errors were encountered: