Thanks to visit codestin.com
Credit goes to github.com

Skip to content

What happend to the idea of adding a 'handle_missing' parameter to the OneHotEncoder? #26543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
woodly0 opened this issue Jun 8, 2023 Discussed in #26531 · 6 comments
Open

Comments

@woodly0
Copy link

woodly0 commented Jun 8, 2023

Discussed in #26531

Originally posted by woodly0 June 7, 2023
Hello,
I'm having trouble understanding what finally happened to the idea of introducing a handle_missing parameter for the OneHotEncoder. My current project could still benefit from such an implementation.
There are many existing issues regarding this topic, however, I cannot deduct what was finally decided/implemented and what wasn't.

Considering the following features:

import pandas as pd

test_df = pd.DataFrame(
    {"col1": ["red", "blue", "blue"], "col2": ["car", None, "plane"]}
)

when using the encoder:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(
    handle_unknown="ignore",
    sparse_output=False,
    #handle_missing="ignore"
)
ohe.fit_transform(test_df)

I get the output:

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

but what I'm actually looking for is to remove the None, i.e. not create a new feature but set all the others to zero:

array([[0., 1., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 1.]])

Is there a way to achieve this without using another transformer object?

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jun 8, 2023
@glemaitre
Copy link
Member

I am not really keen to drop missing values with an option. I would prefer some code allowing to make it more explicit than just an option.

We currently can do that:

import sklearn
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.pipeline import make_pipeline

sklearn.set_config(transform_output="pandas")


def _drop_None_cols(df):
    col_names = [col for col in df.columns if "None" in col]
    if len(col_names):
        return df.drop(columns=col_names)
    return df


encoder_dropping_None = make_pipeline(
    OneHotEncoder(sparse_output=False),
    FunctionTransformer(_drop_None_cols),
)
encoder_dropping_None.fit_transform(test_df)
  col1_blue col1_red col2_car col2_plane
0 0.0 1.0 1.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 1.0

@ogrisel
Copy link
Member

ogrisel commented Jun 16, 2023

I think it would make sense to have:

  • handle_missing="encode_as_category" by default with an option to set it to handle_ignore="ignore",
  • missing_marker=(np.nan, pd.NA, None) which make it possible to specify which values should be consider missing value markers.

The missing_marker parameter would only be used if handle_ignore="ignore" to avoid creating output columns for such values.

@woodly0
Copy link
Author

woodly0 commented Jun 23, 2023

@glemaitre: Thanks for your answer. What you suggest is what I am actually doing already.
My question is about including this in the OneHotEncoder itself and thus avoiding the additional FunctionTransformer.

In order to avoid collinearity, I could use the parameter drop='first' but it seems more intuitive to drop the "_None" feature instead. Don't you agree? It is what the pandas.get_dummies() function does by default.

@woodly0
Copy link
Author

woodly0 commented Jun 29, 2023

@ogrisel: Sorry, I'm not sure to understand what you are suggesting.

@prathu138
Copy link

prathu138 commented Aug 11, 2023

@woodly0 @ogrisel is suggesting that there should be a parameter to handle (null or None or pd. NA) values or we can give values that can be treated like None.

@NTSER
Copy link

NTSER commented May 4, 2025

I am willing to work on that and based on @ogrisel comment I'm thinking of adding two parameters:

handle_missing : {'encode_as_category', 'ignore'}, default='encode_as_category'
    Specifies the strategy to handle missing values during encoding.

    - If "encode_as_category", missing values are treated as a separate category and encoded into distinct
      columns, regardless of the `missing_values` parameter.
    - If "ignore", missing values are excluded from the encoding process, and no column is created for them
      in the output.

missing_values : int, float, str, np.nan or None, default=np.nan
    The placeholder to identify missing values when `handle_missing="ignore"`. All values equal to
    `missing_values` will be considered missing and ignored during encoding.

    This parameter has no effect when `handle_missing="encode_as_category"`.

I chose the name missing_values for consistency with sklearn.impute, and the description aligns with their style.

I'll be waiting for feedback on that and will start working on it if it looks good to everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants