Discretizer #5778

mblondel · 2015-11-10T07:26:43Z

Binarizer transforms continuous values to two states (0 or 1). It would be nice to generalize this to an arbitrary number of states K.

This preprocessor would produce a scipy sparse matrix of shape (n_samples, K * n_features) using the one-of-K encoding. The K thresholds could be chosen uniformly between the min and max of each feature or using the K-quantiles.

For example, using uniformly chosen thresholds, if min=0, max=1.0 and K=3, a feature value between 0 and 0.33 would be encoded as [1, 0, 0], a value between 0.33 and 0.66 as [0, 1, 0] and a value between 0.66 and 1.0 as [0, 0, 1].

My usecase is that this encoding might be more meaningful than continuous values when using PolynomialFeatures.

Possibly related to #1062.

The text was updated successfully, but these errors were encountered:

jnothman · 2015-11-10T07:56:30Z

See also #4468, #4801 which offer an information-theoretic discretisation. Tree-based discretisation is also suggested there.

It's not clear why you want a one-hot matrix initially rather than ordinal features. "Unary" is another possible encoding.

mblondel · 2015-11-10T08:33:39Z

I think the one-hot encoding makes sense for cross-product features. Ordinal features would be useful to remove noise but not for cross-product features.

jnothman · 2015-11-10T08:55:36Z

Yes but we provide tools to transform ordinal into one-hot

On 10 November 2015 at 19:33, Mathieu Blondel [email protected]
wrote:

I think the one-hot encoding makes sense for cross-product features.
Ordinal features would be useful to remove noise but not for cross-product
features.

—
Reply to this email directly or view it on GitHub
#5778 (comment)
.

mblondel · 2015-11-10T09:15:44Z

Good point. I think I could live with a tool that produces ordinal features then :)

raghavrv · 2015-11-10T10:08:30Z

Implementation at #4801

mblondel · 2015-11-10T12:29:39Z

#4801 is not an implementation of what I proposed. Maybe we could have a strategy option (uniform, quantile, mdlp) in a Discretizer class.

jnothman · 2015-11-10T12:32:32Z

MDLP is optionally supervised, iirc.

On 10 November 2015 at 23:29, Mathieu Blondel [email protected]
wrote:

#4801 #4801 is not an
implementation of what I proposed. Maybe we could have a strategy option
(uniform, quantile, mdlp) in a Discretizer class.

—
Reply to this email directly or view it on GitHub
#5778 (comment)
.

hlin117 · 2015-11-10T18:04:00Z

@jnothman : Yes, MDLP is supervised, based upon class labels.

I'll be willing to work on #4801 if there is more interest in it. I've gotten more familiar with the scikit-learn source code since I started the PR, and I'm willing to polish the code up.

hlin117 · 2015-11-10T18:10:29Z

I also opened up #5003 a long while ago.

jnothman · 2015-11-11T01:19:35Z

Thanks @hlin117. It seems the jury is still out on the advantages of MDLP over simpler methods, or methods reusing existing scikit-learn estimators like trees. As @amueller said there, it needs a motivating example to show that it is useful.

hlin117 · 2015-11-11T01:47:58Z

That's fine, I understand, @jnothman.

It seems that @mblondel's discretization description is very similar to a function in R called cut. (cut is a function in R's standard library.)

> vector <- c(0.05, 0.22, 0.33, 0.5, 0.65, 0.99, 0.87)
> cut(vector, breaks=c(0, 0.33, 0.66, 1))
[1] (0,0.33]    (0,0.33]    (0,0.33]    (0.33,0.66] (0.33,0.66] (0.66,1]
[7] (0.66,1]
Levels: (0,0.33] (0.33,0.66] (0.66,1]

The difference is that the user has to designate where the breaks are placed. If the user passes in an integer k for breaks, then it'll bin the points into k intervals, each with the same number of items. (With the chance that one bin will have at most one than another, if the number of items n is not divisible by k.)

hlin117 · 2015-11-12T20:25:36Z

Is a PR welcome for this? We can discuss how we could like the class to be structured in the PR.

mblondel · 2015-11-16T03:01:37Z

@hlin117 That would be nice, thanks!

hlin117 · 2015-11-16T03:42:09Z

Thanks for the support, @mblondel. I'll work on this PR.

mblondel · 2015-11-16T04:09:20Z

I would start simple (only uniform binning) and add more strategies in other PRs.

hlin117 · 2015-11-16T06:54:01Z

Please check the PR in #5825. Thanks!

\scikit-learn#5788: Resolving csc bugs

\scikit-learn#5788: More fixes to the min and max functions

…onversion

…_ and searched_points_

…ization scheme

…rical feature

…retization

jnothman · 2017-07-12T11:32:28Z

A discretizer has been merged to a branch (https://github.com/scikit-learn/scikit-learn/tree/discrete). Should be merged to master once some remaining features and an example are added.

jnothman · 2018-07-12T00:30:55Z

Fixed in #9342

mblondel added the New Feature label Nov 10, 2015

mblondel mentioned this issue Nov 10, 2015

Discretization using Fayyad's MDLP stop criterion #4801

Closed

mblondel mentioned this issue Nov 10, 2015

Add fixed width binning for discretization #5003

Closed

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Created KBins discretizer

fcddcfc

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Refactored code, added doctest

4e55965

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Fixed doctest, added spacing_ attribute

3d52b82

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Renamed X to y, removed y=None

86314cf

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Fixing doctest, again

ae05084

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Fixed doctest, added spacing_ attribute

0501068

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 16, 2015

scikit-learn#5778: Renamed X to y, removed y=None

b554c3e

hlin117 mentioned this issue Nov 16, 2015

[WIP] Adding Fixed Width Discretization #5825

Closed

12 tasks

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015

scikit-learn#5778: Renamed FixedWidthDiscretizer to Discretizer

a2dbbe2

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015

scikit-learn#5778: Allowing Discretizer to accept 2D arrays

00574ca

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Nov 18, 2015

scikit-learn#5778: Fixed bugs, can run doctest

f3370be

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Allowing Discretizer to accept 2D arrays

3e7d475

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Removed the 'strategy' parameter

a1f3cf0

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Added support for sparse matrices csc and csr

b80589b

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Wrote sparse functions and tests for csr_matrix

9647e9d

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Adding sparse modification to Discretizer class

90bada5

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Renamed the data types

1091399

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

\scikit-learn#5778: Fixing up min and max for csr

7335088

\scikit-learn#5788: Resolving csc bugs

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

\scikit-learn#5778: Fixed issue with integer datasets

871ed44

\scikit-learn#5788: More fixes to the min and max functions

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Replaced continuous masks with list indexers

7690c3d

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Changed min and max of csr/csc to use np.float64 c…

6955a76

…onversion

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Replaced cut_points_ attribute with zero_intervals…

8129e16

…_ and searched_points_

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Added test case, fixed bugs related to new discret…

e1f909d

…ization scheme

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Adding cython discretization, need to test

dd7c0ca

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Preliminary fixes to the binary search

aa81520

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Resolving dense matrix discretization

7ffedb0

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Can now discretize csc_matricies with 2 bins

b7cfa62

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Added test case to discretize sparse with 3 bins

ce87d39

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Releasing python gil in cython

204b4d9

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Fixed documentation, added tests for single catego…

3667577

…rical feature

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 8, 2016

scikit-learn#5778: Adding setup.py

c50eaf4

sinhrks mentioned this issue Mar 6, 2016

Do pandas support a supervised binning? pandas-dev/pandas#12542

Closed

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Oct 14, 2016

Issue scikit-learn#5778: Proof of concept rewrite of fixed width disc…

a13dca2

…retization

hlin117 mentioned this issue Oct 14, 2016

[MRG+2] Add fixed width discretization to scikit-learn #7668

Closed

8 tasks

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Feb 19, 2017

Issue scikit-learn#5778: Proof of concept rewrite of fixed width disc…

0005ae1

…retization

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Mar 1, 2017

Issue scikit-learn#5778: Proof of concept rewrite of fixed width disc…

de60086

…retization

hlin117 added a commit to hlin117/scikit-learn that referenced this issue Jun 3, 2017

Issue scikit-learn#5778: Proof of concept rewrite of fixed width disc…

285c80e

…retization

jnothman mentioned this issue Jul 12, 2017

[MRG+2] Merge discrete branch into master #9342

Merged

7 tasks

jnothman closed this as completed Jul 12, 2018

Uh oh!

Discretizer #5778

Discretizer #5778

Comments

mblondel commented Nov 10, 2015

jnothman commented Nov 10, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mblondel commented Nov 10, 2015

Uh oh!

jnothman commented Nov 10, 2015

Uh oh!

mblondel commented Nov 10, 2015

Uh oh!

raghavrv commented Nov 10, 2015

Uh oh!

mblondel commented Nov 10, 2015

Uh oh!

jnothman commented Nov 10, 2015

Uh oh!

hlin117 commented Nov 10, 2015

Uh oh!

hlin117 commented Nov 10, 2015

Uh oh!

jnothman commented Nov 11, 2015

Uh oh!

hlin117 commented Nov 11, 2015

Uh oh!

hlin117 commented Nov 12, 2015

Uh oh!

mblondel commented Nov 16, 2015

Uh oh!

hlin117 commented Nov 16, 2015

Uh oh!

mblondel commented Nov 16, 2015

Uh oh!

hlin117 commented Nov 16, 2015

Uh oh!

jnothman commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jul 12, 2018

Uh oh!

jnothman commented Nov 10, 2015 •

edited

Loading

jnothman commented Jul 12, 2017 •

edited

Loading