Discretization using Fayyad's MDLP stop criterion #4801

hlin117 · 2015-06-02T02:15:10Z

This pull request addresses #4468

This adds the feature of discretization using Fayyad's minimum description length principle (MDLP) stop criterion. The original paper describing the principle is here. Essentially, it splits the continuous attributes into intervals by minimizing the conditional entropy between the attribute in question and the class values.

I demonstrate how to use this feature in this gist. I also show that it produces "approximately" the same output as the corresponding R package "discretization". I say "approximately" because there are indeed some rows where the output is not the same. However, I looked into this issue, and I am assuming that the R package actually has roundoff errors. Also note that this feature allows the users to specify which columns to discretize; the corresponding R package assumes each column is continuous.

jnothman · 2015-06-02T02:41:51Z

Please add tests with examples from the paper, for instance, as well as test for tricky cases. Your code incorporates syntax that is not valid in Python 3 (e.g. lambda (x, y): ...) and other invalid Python 2.6 syntax (dict comprehension). But I suspect almost your entire contribution will be rewritten before merging in order to harness numpy. However, tests can always be implemented, regardless of the internal code structure.

jnothman · 2015-06-02T02:43:24Z

sklearn/preprocessing/discretization.py

+    return log(x, 2) if x > 0 else 0
+
+
+class MDLP(object):


This is not in accordance with scikit-learn's naming scheme. From a glance at your code, it is applicable to multiclass classification problems only. Perhaps we should call this ClassificationDiscretizer or MDLDiscretizer.

hlin117 · 2015-06-02T04:31:51Z

Thanks for the review, @jnothman. I'll rewrite my code according to your comments.

jnothman · 2015-06-02T05:00:51Z

Just note it was not a full review, rather some ideas to take into
consideration.

On 2 June 2015 at 14:32, Henry [email protected] wrote:

Thanks for the review, @jnothman https://github.com/jnothman. I'll
rewrite my code according to your comments.

—
Reply to this email directly or view it on GitHub
#4801 (comment)
.

hlin117 · 2015-06-02T07:52:54Z

I updated the code according to @jnothman's comments above. I also tried to follow more closely to the scikit-learn contribution guidelines here.

hlin117 · 2015-06-02T07:56:33Z

A point to make is that the discretization of MDLP might not be the same as before, but the algorithm did not change. There is a caveat when attributes x are repeated, but their class labels y might be different. The algorithm will sort according to x, and so the entropy at different cut points may vary.

jnothman · 2015-06-02T09:01:58Z

make it stable, at least, by doing a np.lexsort on x and y

On 2 June 2015 at 17:57, Henry [email protected] wrote:

A point to make is that the discretization of MDLP might not be the same
as before, but the algorithm did not change. There is a caveat when
attributes x are repeated, but their class labels y might be different.
The algorithm will sort according to x, and so the entropy at different
cut points may vary.

—
Reply to this email directly or view it on GitHub
#4801 (comment)
.

jnothman · 2015-06-03T00:30:22Z

sklearn/preprocessing/discretization.py

+    for Classification Learning"
+    """
+
+    def __init__(self, **params):


Interesting that you've interpreted the contributors' guide this way... All our estimators have an explicit list of parameters, not ** magic. With an explicit list of parameters, the implementation of get_params and set_params in BaseEstimator will suffice.

Thanks for the tip. I think I got confused by the documentation of the OneHotEncoderhere, whose set_params() function takes **. But now that I know that the OneHotEncoder just inherits that function from its base class, which has the ** parameters, this makes sense to me.

jnothman · 2015-06-03T08:32:03Z

PS: still really looking for tests

hlin117 · 2015-06-03T15:55:09Z

@jnothman: I'll be happy to provide more tests. Where should I commit these tests though? Should I commit them as github gists?

jnothman · 2015-06-03T15:59:33Z

Tests are coded in separate modules. See sklearn/preprocessing/tests

On 4 June 2015 at 01:55, Henry [email protected] wrote:

@jnothman https://github.com/jnothman: I'll be happy to provide more
tests. Where should I commit these tests though? Should I commit them as
github gists?

—
Reply to this email directly or view it on GitHub
#4801 (comment)
.

amueller · 2015-06-03T19:14:19Z

Also, a usage example would be great. Your example shows just what happens, but not why it is useful. It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest?

GaelVaroquaux · 2015-06-03T19:24:33Z

It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest?

Yes. This is what I have been wondering since the beginning of this PR.

amueller · 2015-06-03T19:26:47Z

I can imagine there are use-cases where you want to train a linear model for speed reasons or because trees don't work on some variables or something. But I need to see an application / comparison to believe it is useful.

mblondel · 2015-11-10T12:44:17Z

I suspect more useful discretizers/binning to include may be those that are substantially less complex (e.g. equal-width or equal-population buckets).

+1

I suspect uniform binning and quantile based binning should work well enough in many applications.
I created #5778 to track this issue.

Also lack of sparse data support is problematic.

mblondel · 2015-11-10T12:45:11Z

sklearn/preprocessing/_mdlp.pyx

+        currlevel = search_intervals.back()
+        search_intervals.pop_back()
+        start, end, depth = unwrap(currlevel)
+        PyMem_Free(currlevel)


We try to avoid manual memory management in the project.

I'm only a beginner when it comes to Cython. Does Cython have a garbage collector, or will this cause a memory leak?

hlin117 · 2015-11-10T18:02:01Z

I can definitely work on this PR more if there is more interest in it. I've gotten more familiar with the scikit-learn source code since then, and I'm willing to polish this code up.

gravesee · 2016-04-13T12:59:49Z

I just wanted to add that a very useful use case of supervised discretization is scorecard modeling in a heavily regulated industry like credit decisioning. Uniform binning and quantile binning are usually not sufficient for zero-inflated data, for example.

This technique won't help win Kaggle competitions, but it is very useful when transparency is more important than predictive performance.

hlin117 · 2017-02-20T09:39:32Z

Closing this for now. If you would like to see a related project, see https://github.com/hlin117/mdlp-discretization. Pull requests to this project are welcome!

jnothman reviewed Jun 2, 2015
View reviewed changes

hlin117 force-pushed the discretization branch from 9fb1ffc to 9d3c118 Compare June 2, 2015 17:16

jnothman reviewed Jun 3, 2015
View reviewed changes

hlin117 added 17 commits September 21, 2015 15:44

fixed bug regarding floating point division

80ec412

using more vectorized functions, and renamed variables

4d81946

adding code to comply with scikit-learn guidelines

f1e5adf

trying to resolve travis build failures

68bb30e

added more documentation, and renamed variables

18fcd82

more rigorous type checking using sklearn.utils

bfcc45c

implemented mdlp with cut points, not intervals

00932fb

got rid of partition_entropy function

cd1da2c

randomizing order before sorting

688d307

added shuffle option

8ff3e69

fixed bug with creating intervals

7c01f05

changed documentation, can now discretize 1-d arrays

247d0df

MDLP: Added TransformerMixin parent class

fc86f1c

MDLP: Added first test, need to add smaller test case

6b21f8e

MDLP: Using Cython to speed up finding a good cut

cdb32e4

MDLP: Adding checks for inputs to transform

0e99188

MDLP: Edited setup.py file

e05aabb

hlin117 force-pushed the discretization branch from 3750f2a to e05aabb Compare September 21, 2015 20:52

hlin117 added 2 commits September 21, 2015 16:29

MDLP: Added more cython code, reworded some docs and checks

54fa608

MDLP: Removed unnecessary print statement, and updated cpp file

94cfb7e

raghavrv mentioned this pull request Nov 10, 2015

Discretizer #5778

Closed

mblondel reviewed Nov 10, 2015
View reviewed changes

hlin117 closed this Feb 20, 2017

jnothman mentioned this pull request Mar 21, 2018

Discretization #10848

Closed

This was referenced Nov 6, 2019

Implementation of MDLP for discretization of continuous attributes #4468

Closed

Supervised discretization #15551

Open

Uh oh!

Discretization using Fayyad's MDLP stop criterion #4801

Discretization using Fayyad's MDLP stop criterion #4801

Uh oh!

Conversation

hlin117 commented Jun 2, 2015

Uh oh!

jnothman commented Jun 2, 2015

Uh oh!

jnothman Jun 2, 2015

Choose a reason for hiding this comment

Uh oh!

hlin117 commented Jun 2, 2015

Uh oh!

jnothman commented Jun 2, 2015

Uh oh!

hlin117 commented Jun 2, 2015

Uh oh!

hlin117 commented Jun 2, 2015

Uh oh!

jnothman commented Jun 2, 2015

Uh oh!

jnothman Jun 3, 2015

Choose a reason for hiding this comment

Uh oh!

hlin117 Jun 3, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 3, 2015

Uh oh!

hlin117 commented Jun 3, 2015

Uh oh!

jnothman commented Jun 3, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

GaelVaroquaux commented Jun 3, 2015 via email

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

mblondel commented Nov 10, 2015

Uh oh!

mblondel Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

hlin117 Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

hlin117 commented Nov 10, 2015

Uh oh!

gravesee commented Apr 13, 2016

Uh oh!

hlin117 commented Feb 20, 2017

Uh oh!

Uh oh!