-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX poisson proxy_impurity_improvement #22191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does the |
proxy_impurity_left -= self.sum_left[k] * log(y_mean_left) | ||
proxy_impurity_right -= self.sum_right[k] * log(y_mean_right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix.
@glemaitre @thomasjpfan Friendly ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this PR @lorentzenchr !
I think this bug is very subtle. From my understanding, the proxy_impurity_improvement
API expects the impurities to be "scaled back up":
scikit-learn/sklearn/tree/_criterion.pyx
Lines 165 to 168 in 9816b35
self.children_impurity(&impurity_left, &impurity_right) | |
return (- self.weighted_n_right * impurity_right | |
- self.weighted_n_left * impurity_left) |
For the Poisson case, the weighted_n_*
gets canceled out.
Is this your understanding of the bug as well?
Yes, the weights of the (candidate) left and right nodes get cancelled out in the lines scikit-learn/sklearn/tree/_criterion.pyx Lines 165 to 168 in 9816b35
Example: impurity = MSE =
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I also see a comment at the beginning of the class regarding the proxy computation. Do you want to remove it then? |
For information, I tried to test the impact of this PR on the poisson regression example : https://scikit-learn.org/stable/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html I defined a from sklearn.ensemble import RandomForestRegressor
poisson_rf = Pipeline(
[
("preprocessor", tree_preprocessor),
(
"regressor",
RandomForestRegressor(criterion="poisson", min_samples_leaf=10, n_jobs=-1),
),
]
)
poisson_rf.fit(
df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]
)
print("Poisson Random Forest evaluation:")
score_estimator(poisson_rf, df_test) and then computed the deviance when training on 10% of the data (due to longer training times of RF compared to linear models and GBDT) and I found that:
I find b a bit worrying but maybe this is expected? Disclaimer: I did not run a full hyper-parameter optimization for RF models. In particular, it's possible that the RF models would perform better with more trees (but those are slow...). Edit: I made a mistake in my first batch of experiments where I forgot to recompile when switching branch... but after re-compiling, the previous comments still mostly hold. But in any case, I made the following surprising observation with RFs: on this example, the following RF: RandomForestRegressor(
criterion="poisson", n_estimators=5, min_samples_leaf=1000, n_jobs=-1
) performs as good as the same RF with hundreds of trees and much better than deep RF with lower values for I wonder if a simple averaging the y_hats in RF is optimal for Poisson regression. |
I confirm that this PR fixes the case originally reported as #22186 (comment): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, thanks for the new comments.
Based on #22191 (comment), the original problem is fixed, +1 for merge.
- |Fix| Fix a bug in the Poisson splitting criterion for | ||
:class:`tree.DecisionTreeRegressor`. | ||
:pr:`22191` by :user:`Christian Lorentzen <lorentzenchr>`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops I realize that I merged too quickly: the sklearn.svm
section has been split. Let me open a PR to fix this.
Reference Issues/PRs
Fixes #22186.
What does this implement/fix? Explain your changes.
This fixes
proxy_impurity_improvement
for the Poisson splitting criterion inDecisionTreeRegressor
.Any other comments?
Test now pass with tighter bounds.