-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Include drop='last' to OneHotEncoder #23436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is a import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame(['Male', 'Female', np.nan])
ohe = OneHotEncoder(drop=[np.nan])
ohe.fit_transform(df).toarray() Output:
|
I think the original request could be a convenience in some cases but it's a bit weird to have a drop strategy that depends on the lexicographical ordering of the categories. For this particular case, I think @lesteve's solution is the right approach. We could also offer to drop the Since the choice of the category to drop has an impact on the effect of regularization in a downstream linear model, dropping based on the frequency on the training set might be a good strategy to reduce collinearity for non-binary categorical variables in general. Maybe @lorentzenchr has an opinion on this. |
IIRC, R‘s formula drops the first level by default. That corresponds to our option More striking is the argument to have We could also consider to support sample weights with "most_frequent", i.e. choose the level with highest sum of sample weights. To the best of my knowledge, for linear models with penalties, one should never drop a level. For unpenalized linear models, the converse is true: always drop one level (otherwise one has perfect collinearity and solvers have a much harder job finding the unique minimum norm solution). |
There is a (stalled) open PR for |
I agree it is not worth it to have a |
Describe the workflow you want to enable
When using
SimpleImputer
+OneHotEncoder
, I am able to add a new constant category for NaN values like the example below:However, I wanted to have an argument like
OneHotEncoder(drop='last')
in order to have an output like:This would allow all NaNs to be filled with zeros.
Describe your proposed solution
Describe alternatives you've considered, if relevant
There's no good alternative for compatibility with sklearn's pipelines. I was following the issue #11996 of adding a handle_missing to OneHotEncoder but it has been ignored in favor of using a "constant" strategy on the categorical columns. But the constant strategy will add an unnecessary new column that could be dropped in this scenario.
Additional context
No response
The text was updated successfully, but these errors were encountered: