Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Discretization using Fayyad's MDLP stop criterion #4801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

hlin117
Copy link
Contributor

@hlin117 hlin117 commented Jun 2, 2015

This pull request addresses #4468

This adds the feature of discretization using Fayyad's minimum description length principle (MDLP) stop criterion. The original paper describing the principle is here. Essentially, it splits the continuous attributes into intervals by minimizing the conditional entropy between the attribute in question and the class values.

I demonstrate how to use this feature in this gist. I also show that it produces "approximately" the same output as the corresponding R package "discretization". I say "approximately" because there are indeed some rows where the output is not the same. However, I looked into this issue, and I am assuming that the R package actually has roundoff errors. Also note that this feature allows the users to specify which columns to discretize; the corresponding R package assumes each column is continuous.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2015

Please add tests with examples from the paper, for instance, as well as test for tricky cases. Your code incorporates syntax that is not valid in Python 3 (e.g. lambda (x, y): ...) and other invalid Python 2.6 syntax (dict comprehension). But I suspect almost your entire contribution will be rewritten before merging in order to harness numpy. However, tests can always be implemented, regardless of the internal code structure.

return log(x, 2) if x > 0 else 0


class MDLP(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in accordance with scikit-learn's naming scheme. From a glance at your code, it is applicable to multiclass classification problems only. Perhaps we should call this ClassificationDiscretizer or MDLDiscretizer.

@hlin117
Copy link
Contributor Author

hlin117 commented Jun 2, 2015

Thanks for the review, @jnothman. I'll rewrite my code according to your comments.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2015

Just note it was not a full review, rather some ideas to take into
consideration.

On 2 June 2015 at 14:32, Henry [email protected] wrote:

Thanks for the review, @jnothman https://github.com/jnothman. I'll
rewrite my code according to your comments.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@hlin117
Copy link
Contributor Author

hlin117 commented Jun 2, 2015

I updated the code according to @jnothman's comments above. I also tried to follow more closely to the scikit-learn contribution guidelines here.

@hlin117
Copy link
Contributor Author

hlin117 commented Jun 2, 2015

A point to make is that the discretization of MDLP might not be the same as before, but the algorithm did not change. There is a caveat when attributes x are repeated, but their class labels y might be different. The algorithm will sort according to x, and so the entropy at different cut points may vary.

@jnothman
Copy link
Member

jnothman commented Jun 2, 2015

make it stable, at least, by doing a np.lexsort on x and y

On 2 June 2015 at 17:57, Henry [email protected] wrote:

A point to make is that the discretization of MDLP might not be the same
as before, but the algorithm did not change. There is a caveat when
attributes x are repeated, but their class labels y might be different.
The algorithm will sort according to x, and so the entropy at different
cut points may vary.


Reply to this email directly or view it on GitHub
#4801 (comment)
.

for Classification Learning"
"""

def __init__(self, **params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that you've interpreted the contributors' guide this way... All our estimators have an explicit list of parameters, not ** magic. With an explicit list of parameters, the implementation of get_params and set_params in BaseEstimator will suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I think I got confused by the documentation of the OneHotEncoderhere, whose set_params() function takes **. But now that I know that the OneHotEncoder just inherits that function from its base class, which has the ** parameters, this makes sense to me.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2015

PS: still really looking for tests

@hlin117
Copy link
Contributor Author

hlin117 commented Jun 3, 2015

@jnothman: I'll be happy to provide more tests. Where should I commit these tests though? Should I commit them as github gists?

@jnothman
Copy link
Member

jnothman commented Jun 3, 2015

Tests are coded in separate modules. See sklearn/preprocessing/tests

On 4 June 2015 at 01:55, Henry [email protected] wrote:

@jnothman https://github.com/jnothman: I'll be happy to provide more
tests. Where should I commit these tests though? Should I commit them as
github gists?


Reply to this email directly or view it on GitHub
#4801 (comment)
.

@amueller
Copy link
Member

amueller commented Jun 3, 2015

Also, a usage example would be great. Your example shows just what happens, but not why it is useful. It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 3, 2015 via email

@amueller
Copy link
Member

amueller commented Jun 3, 2015

I can imagine there are use-cases where you want to train a linear model for speed reasons or because trees don't work on some variables or something. But I need to see an application / comparison to believe it is useful.

@raghavrv raghavrv mentioned this pull request Nov 10, 2015
@mblondel
Copy link
Member

I suspect more useful discretizers/binning to include may be those that are substantially less complex (e.g. equal-width or equal-population buckets).

+1

I suspect uniform binning and quantile based binning should work well enough in many applications.
I created #5778 to track this issue.

Also lack of sparse data support is problematic.

currlevel = search_intervals.back()
search_intervals.pop_back()
start, end, depth = unwrap(currlevel)
PyMem_Free(currlevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid manual memory management in the project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only a beginner when it comes to Cython. Does Cython have a garbage collector, or will this cause a memory leak?

@hlin117
Copy link
Contributor Author

hlin117 commented Nov 10, 2015

I can definitely work on this PR more if there is more interest in it. I've gotten more familiar with the scikit-learn source code since then, and I'm willing to polish this code up.

@gravesee
Copy link

I just wanted to add that a very useful use case of supervised discretization is scorecard modeling in a heavily regulated industry like credit decisioning. Uniform binning and quantile binning are usually not sufficient for zero-inflated data, for example.

This technique won't help win Kaggle competitions, but it is very useful when transparency is more important than predictive performance.

@hlin117
Copy link
Contributor Author

hlin117 commented Feb 20, 2017

Closing this for now. If you would like to see a related project, see https://github.com/hlin117/mdlp-discretization. Pull requests to this project are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants