Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Discretizer #5778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Nov 10, 2015 · 18 comments
Closed

Discretizer #5778

mblondel opened this issue Nov 10, 2015 · 18 comments

Comments

@mblondel
Copy link
Member

Binarizer transforms continuous values to two states (0 or 1). It would be nice to generalize this to an arbitrary number of states K.

This preprocessor would produce a scipy sparse matrix of shape (n_samples, K * n_features) using the one-of-K encoding. The K thresholds could be chosen uniformly between the min and max of each feature or using the K-quantiles.

For example, using uniformly chosen thresholds, if min=0, max=1.0 and K=3, a feature value between 0 and 0.33 would be encoded as [1, 0, 0], a value between 0.33 and 0.66 as [0, 1, 0] and a value between 0.66 and 1.0 as [0, 0, 1].

My usecase is that this encoding might be more meaningful than continuous values when using PolynomialFeatures.

Possibly related to #1062.

@jnothman
Copy link
Member

jnothman commented Nov 10, 2015

See also #4468, #4801 which offer an information-theoretic discretisation. Tree-based discretisation is also suggested there.

It's not clear why you want a one-hot matrix initially rather than ordinal features. "Unary" is another possible encoding.

@mblondel
Copy link
Member Author

I think the one-hot encoding makes sense for cross-product features. Ordinal features would be useful to remove noise but not for cross-product features.

@jnothman
Copy link
Member

Yes but we provide tools to transform ordinal into one-hot

On 10 November 2015 at 19:33, Mathieu Blondel [email protected]
wrote:

I think the one-hot encoding makes sense for cross-product features.
Ordinal features would be useful to remove noise but not for cross-product
features.


Reply to this email directly or view it on GitHub
#5778 (comment)
.

@mblondel
Copy link
Member Author

Good point. I think I could live with a tool that produces ordinal features then :)

@raghavrv
Copy link
Member

Implementation at #4801

@mblondel
Copy link
Member Author

#4801 is not an implementation of what I proposed. Maybe we could have a strategy option (uniform, quantile, mdlp) in a Discretizer class.

@jnothman
Copy link
Member

MDLP is optionally supervised, iirc.

On 10 November 2015 at 23:29, Mathieu Blondel [email protected]
wrote:

#4801 #4801 is not an
implementation of what I proposed. Maybe we could have a strategy option
(uniform, quantile, mdlp) in a Discretizer class.


Reply to this email directly or view it on GitHub
#5778 (comment)
.

@hlin117
Copy link
Contributor

hlin117 commented Nov 10, 2015

@jnothman : Yes, MDLP is supervised, based upon class labels.

I'll be willing to work on #4801 if there is more interest in it. I've gotten more familiar with the scikit-learn source code since I started the PR, and I'm willing to polish the code up.

@hlin117
Copy link
Contributor

hlin117 commented Nov 10, 2015

I also opened up #5003 a long while ago.

@jnothman
Copy link
Member

Thanks @hlin117. It seems the jury is still out on the advantages of MDLP over simpler methods, or methods reusing existing scikit-learn estimators like trees. As @amueller said there, it needs a motivating example to show that it is useful.

@hlin117
Copy link
Contributor

hlin117 commented Nov 11, 2015

That's fine, I understand, @jnothman.

It seems that @mblondel's discretization description is very similar to a function in R called cut. (cut is a function in R's standard library.)

> vector <- c(0.05, 0.22, 0.33, 0.5, 0.65, 0.99, 0.87)
> cut(vector, breaks=c(0, 0.33, 0.66, 1))
[1] (0,0.33]    (0,0.33]    (0,0.33]    (0.33,0.66] (0.33,0.66] (0.66,1]
[7] (0.66,1]
Levels: (0,0.33] (0.33,0.66] (0.66,1]

The difference is that the user has to designate where the breaks are placed. If the user passes in an integer k for breaks, then it'll bin the points into k intervals, each with the same number of items. (With the chance that one bin will have at most one than another, if the number of items n is not divisible by k.)

@hlin117
Copy link
Contributor

hlin117 commented Nov 12, 2015

Is a PR welcome for this? We can discuss how we could like the class to be structured in the PR.

@mblondel
Copy link
Member Author

@hlin117 That would be nice, thanks!

@hlin117
Copy link
Contributor

hlin117 commented Nov 16, 2015

Thanks for the support, @mblondel. I'll work on this PR.

@mblondel
Copy link
Member Author

I would start simple (only uniform binning) and add more strategies in other PRs.

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015
@hlin117
Copy link
Contributor

hlin117 commented Nov 16, 2015

Please check the PR in #5825. Thanks!

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Oct 14, 2016
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 19, 2017
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Mar 1, 2017
hlin117 added a commit to hlin117/scikit-learn that referenced this issue Jun 3, 2017
@jnothman
Copy link
Member

jnothman commented Jul 12, 2017

A discretizer has been merged to a branch (https://github.com/scikit-learn/scikit-learn/tree/discrete). Should be merged to master once some remaining features and an example are added.

@jnothman
Copy link
Member

Fixed in #9342

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants