Thanks to visit codestin.com
Credit goes to github.com

Skip to content

KBinsDiscretizer: allow nans #9341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jnothman opened this issue Jul 12, 2017 · 15 comments · May be fixed by #17179
Open

KBinsDiscretizer: allow nans #9341

jnothman opened this issue Jul 12, 2017 · 15 comments · May be fixed by #17179

Comments

@jnothman
Copy link
Member

jnothman commented Jul 12, 2017

Missing values, represented as NaN, could be treated as a separate category in discretization. This seems much more sensible to me than imputing the missing data then discretizing.

In accordance with recent changes to other preprocessing, NaNs would simply be ignored in calculating fit statistics, and would be passed on to the encoder in transform. I can't recall if we're handling this sensibly in OneHotEncoder yet...

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement Need Contributor labels Jul 12, 2017
@musiciancodes
Copy link

I'll start working on this.

@jnothman
Copy link
Member Author

jnothman commented Jul 16, 2017 via email

@hristog
Copy link
Contributor

hristog commented Oct 7, 2017

Hi - is this being actively worked on? If not, I would like to pick it up.

@jnothman
Copy link
Member Author

jnothman commented Oct 8, 2017

I think you are welcome to.

@jnothman jnothman changed the title discrete branch: allow nans? KBinsDiscretizer: allow nans Jul 12, 2018
@Framartin
Copy link
Contributor

I would like to work on this issue.

But I believe that #11996 should be solved before, because KBinsDiscretizer makes use of OneHotEncoder. As suggested above by @jnothman, it would be simpler to let OneHotEncoder handles NaNs.
Can I open a new PR, marked as [WIP], to start working on it? Or should I wait for the closing of #11996 to open a PR?

@jnothman
Copy link
Member Author

jnothman commented Dec 24, 2018 via email

@jnothman
Copy link
Member Author

jnothman commented Dec 24, 2018 via email

@Framartin
Copy link
Contributor

Thanks a lot for pointing out #12045 (I didn't noticed it). It will be useful to handle missing value when encode='ordinal'.

I will take a look at #11996 to see if I can take over the work on OneHotEncoder (to handle missing values when encode in ['onehot', 'onehot-dense']).

@SamDuan
Copy link

SamDuan commented Nov 4, 2019

Just wonder if there is any update on this?
I was trying to use KBinsDiscretizer on data containing NaNs, and NaNs can be placed into their own bucket.

@jnothman
Copy link
Member Author

jnothman commented Nov 4, 2019

Nothing merged yet, unfortunately. Help with the open work is welcome, thanks @SamDuan

@SamDuan
Copy link

SamDuan commented Nov 4, 2019

I see. Let me look into it.

@NicolasHug
Copy link
Member

This is actually already implemented but in the private ensemble/hist_gradient_boosting/binning API.

We could think of some ways to unify both classes

@jnothman
Copy link
Member Author

jnothman commented Nov 4, 2019 via email

@Guillermogsjc
Copy link

Guillermogsjc commented Jan 10, 2020

Hi all,

this solution on my GIST allows quantile transforming with nans. It is pandas based so not np.ndarray friendly.

It is designed for pandas DataFrames so there must be some more glue code to reach full sklearn compatibility (anyway I am using it without any problem in my sklearn Pipeline).

@PabloRMira
Copy link

Hi all,

I would like to help on this. I think, a good strategy would be to set the NaN-category to -1 in the ordinal encoding which then propagates naturally to the onehot-encoding.

What do you think about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.