-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Discretization using Fayyad's MDLP stop criterion #4801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please add tests with examples from the paper, for instance, as well as test for tricky cases. Your code incorporates syntax that is not valid in Python 3 (e.g. |
return log(x, 2) if x > 0 else 0 | ||
|
||
|
||
class MDLP(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in accordance with scikit-learn's naming scheme. From a glance at your code, it is applicable to multiclass classification problems only. Perhaps we should call this ClassificationDiscretizer
or MDLDiscretizer
.
Thanks for the review, @jnothman. I'll rewrite my code according to your comments. |
Just note it was not a full review, rather some ideas to take into On 2 June 2015 at 14:32, Henry [email protected] wrote:
|
A point to make is that the discretization of MDLP might not be the same as before, but the algorithm did not change. There is a caveat when attributes |
make it stable, at least, by doing a np.lexsort on x and y On 2 June 2015 at 17:57, Henry [email protected] wrote:
|
for Classification Learning" | ||
""" | ||
|
||
def __init__(self, **params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that you've interpreted the contributors' guide this way... All our estimators have an explicit list of parameters, not **
magic. With an explicit list of parameters, the implementation of get_params
and set_params
in BaseEstimator
will suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip. I think I got confused by the documentation of the OneHotEncoder
here, whose set_params()
function takes **
. But now that I know that the OneHotEncoder
just inherits that function from its base class, which has the **
parameters, this makes sense to me.
PS: still really looking for tests |
@jnothman: I'll be happy to provide more tests. Where should I commit these tests though? Should I commit them as github gists? |
Tests are coded in separate modules. See sklearn/preprocessing/tests On 4 June 2015 at 01:55, Henry [email protected] wrote:
|
Also, a usage example would be great. Your example shows just what happens, but not why it is useful. It would be great to have a data set and task where using this method actually improves upon the numeric features. Why not just put them in a forest? |
It would be great to have a data set and task where using this method
actually improves upon the numeric features. Why not just put them in a
forest?
Yes. This is what I have been wondering since the beginning of this PR.
|
I can imagine there are use-cases where you want to train a linear model for speed reasons or because trees don't work on some variables or something. But I need to see an application / comparison to believe it is useful. |
3750f2a
to
e05aabb
Compare
+1 I suspect uniform binning and quantile based binning should work well enough in many applications. Also lack of sparse data support is problematic. |
currlevel = search_intervals.back() | ||
search_intervals.pop_back() | ||
start, end, depth = unwrap(currlevel) | ||
PyMem_Free(currlevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid manual memory management in the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only a beginner when it comes to Cython. Does Cython have a garbage collector, or will this cause a memory leak?
I can definitely work on this PR more if there is more interest in it. I've gotten more familiar with the scikit-learn source code since then, and I'm willing to polish this code up. |
I just wanted to add that a very useful use case of supervised discretization is scorecard modeling in a heavily regulated industry like credit decisioning. Uniform binning and quantile binning are usually not sufficient for zero-inflated data, for example. This technique won't help win Kaggle competitions, but it is very useful when transparency is more important than predictive performance. |
Closing this for now. If you would like to see a related project, see https://github.com/hlin117/mdlp-discretization. Pull requests to this project are welcome! |
This pull request addresses #4468
This adds the feature of discretization using Fayyad's minimum description length principle (MDLP) stop criterion. The original paper describing the principle is here. Essentially, it splits the continuous attributes into intervals by minimizing the conditional entropy between the attribute in question and the class values.
I demonstrate how to use this feature in this gist. I also show that it produces "approximately" the same output as the corresponding R package "discretization". I say "approximately" because there are indeed some rows where the output is not the same. However, I looked into this issue, and I am assuming that the R package actually has roundoff errors. Also note that this feature allows the users to specify which columns to discretize; the corresponding R package assumes each column is continuous.