-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Proxy improvement methods added to entropy/gini #5233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you add benchmark using covertype and mnist ? |
|
||
cdef double wl_log_wl = weighted_n_left * log(weighted_n_left) | ||
cdef double wr_log_wr = weighted_n_right * log(weighted_n_right) | ||
cdef double entropy = -wl_log_wl - wr_log_wr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the three previous line could be combine without loss of clarity.
Hm travis is not happy. I tried that briefly in #5220 and got beaten by exactly the same error. I believe that this is due to numerical instabilities and to approximations of big numbers by the "float system". |
I have check if we have the same numerical instabilities for regression, but it doesn't seem to be the case
PS: note that |
I dont have much time today to dig into this, but it is very important to make sure that variable importances converge to their true theoretical values. (We should add a test for that -- I'll do it.) In addition, variable importances should be invariant with respect to scaling the sample weights, as |
The more trees you pass in, the closer to the theoretical values you get, but you don't get them exactly. |
|
||
for k in range(n_outputs): | ||
gini_left = 0.0 | ||
gini_right = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange, this should be remove. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. That's weird that multi-output tests were successful with this.
I am not getting a significant speed up on covtypes or mnist, and there are other errors, so I am going to close this PR. |
Thanks for your hard work ! |
This pull request handles implementing explicit proxy improvement methods for entropy and gini criterion, to compliment the one added in #5203. Entropy is changed by refactoring the equation
-weighted_n_left / weighted_n_total * -sum( count_left_i / weighted_n_left * log( count_left_i / weighted_n_left ) - weighted_n_right / weighted_n_total * -sum( count_i / weighted_n_right * log(count_right_i / weighted_n_right )
into
sum( count_left_i * log( count_left_i ) ) + sum( count_right_i * log( count_right_i ) ) - weighted_n_left * log(weighted_n_left) - weighted_n_right * log(weighted_n_right)
which collapses some terms, and caches the calculation involving
weighted_n_left
andweighted_n_right
.The gini calculation time is collapsed into
Here are time tests. Entropy seems to take basically the same amount of time, whereas gini seems to run slightly faster. I was unsure if this was worth merging, so figured I'd see what you had to say.
@glouppe @arjoly @ogrisel