Include drop='last' to OneHotEncoder #23436

WittmannF · 2022-05-20T21:19:45Z

Describe the workflow you want to enable

When using SimpleImputer + OneHotEncoder, I am able to add a new constant category for NaN values like the example below:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)

# array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

However, I wanted to have an argument like OneHotEncoder(drop='last') in order to have an output like:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])

This would allow all NaNs to be filled with zeros.

Describe your proposed solution

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(drop='last'))])

Describe alternatives you've considered, if relevant

There's no good alternative for compatibility with sklearn's pipelines. I was following the issue #11996 of adding a handle_missing to OneHotEncoder but it has been ignored in favor of using a "constant" strategy on the categorical columns. But the constant strategy will add an unnecessary new column that could be dropped in this scenario.

Additional context

No response

The text was updated successfully, but these errors were encountered:

lesteve · 2022-05-25T05:47:14Z

There is a drop argument in OneHotEncoder which you can pass a array to (one category to drop for each feature), can you use this for you use case? Adapting your snippet, something like this:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame(['Male', 'Female', np.nan])
ohe = OneHotEncoder(drop=[np.nan])
ohe.fit_transform(df).toarray()

Output:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])

ogrisel · 2022-06-02T15:22:42Z

I think the original request could be a convenience in some cases but it's a bit weird to have a drop strategy that depends on the lexicographical ordering of the categories.

For this particular case, I think @lesteve's solution is the right approach.

We could also offer to drop the "most_frequent" (and maybe also "least_frequent" although the latter is probably very unstable under CV/resampling) to drop a category based on its frequency in the training set.

Since the choice of the category to drop has an impact on the effect of regularization in a downstream linear model, dropping based on the frequency on the training set might be a good strategy to reduce collinearity for non-binary categorical variables in general.

Maybe @lorentzenchr has an opinion on this.

lorentzenchr · 2022-06-05T16:53:25Z

IIRC, R‘s formula drops the first level by default. That corresponds to our option "first". ~~Having first, why not have an option "last".~~ Edit: maybe not worth it.

More striking is the argument to have "most_frequent". This is a strategy that I‘ve seen in other GLM software. Model coefficients are then the effect relative to this most frequent level which seems the most natural choice (but is irrelevant for most other things).

We could also consider to support sample weights with "most_frequent", i.e. choose the level with highest sum of sample weights.

To the best of my knowledge, for linear models with penalties, one should never drop a level. For unpenalized linear models, the converse is true: always drop one level (otherwise one has perfect collinearity and solvers have a much harder job finding the unique minimum norm solution).

ogrisel · 2022-09-09T11:51:20Z

There is a (stalled) open PR for drop="most_frequent" here: #18679.

thomasjpfan · 2022-11-04T15:59:17Z

I agree it is not worth it to have a drop="last". The drop="most_frequent" feature is being tracked in #18553.

WittmannF added Needs Triage Issue requires triage New Feature labels May 20, 2022

thomasjpfan added module:preprocessing Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Jun 2, 2022

ogrisel mentioned this issue Sep 9, 2022

add 'most_frequent' drop method to OneHotEncoder #18553

Open

thomasjpfan closed this as completed Nov 4, 2022

woodly0 mentioned this issue Jun 8, 2023

What happend to the idea of adding a 'handle_missing' parameter to the OneHotEncoder? #26543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include drop='last' to OneHotEncoder #23436

Include drop='last' to OneHotEncoder #23436

WittmannF commented May 20, 2022 •

edited by ogrisel

Loading

lesteve commented May 25, 2022

ogrisel commented Jun 2, 2022 •

edited

Loading

lorentzenchr commented Jun 5, 2022 •

edited

Loading

ogrisel commented Sep 9, 2022 •

edited

Loading

thomasjpfan commented Nov 4, 2022

Include drop='last' to OneHotEncoder #23436

Include drop='last' to OneHotEncoder #23436

Comments

WittmannF commented May 20, 2022 • edited by ogrisel Loading

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

lesteve commented May 25, 2022

ogrisel commented Jun 2, 2022 • edited Loading

lorentzenchr commented Jun 5, 2022 • edited Loading

ogrisel commented Sep 9, 2022 • edited Loading

thomasjpfan commented Nov 4, 2022

WittmannF commented May 20, 2022 •

edited by ogrisel

Loading

ogrisel commented Jun 2, 2022 •

edited

Loading

lorentzenchr commented Jun 5, 2022 •

edited

Loading

ogrisel commented Sep 9, 2022 •

edited

Loading