FIX poisson proxy_impurity_improvement #22191

lorentzenchr · 2022-01-11T18:10:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This fixes proxy_impurity_improvement for the Poisson splitting criterion in DecisionTreeRegressor.

Any other comments?

Test now pass with tighter bounds.

RAMitchell · 2022-01-11T21:43:32Z

Does the proxy_impurity_improvement return value need to be scaled by number of instances? Not exactly sure how sklearn operates, but you might want to check this scaled correctly so that any 'min_impurity' type checks are consistent with the loss function.

This reverts commit 7e5a6da.

lorentzenchr · 2022-01-14T10:58:33Z

sklearn/tree/_criterion.pyx

+                proxy_impurity_left -= self.sum_left[k] * log(y_mean_left)
+                proxy_impurity_right -= self.sum_right[k] * log(y_mean_right)


This is the fix.

lorentzenchr · 2022-01-18T10:45:23Z

@glemaitre @thomasjpfan Friendly ping.

thomasjpfan

Thanks for working on this PR @lorentzenchr !

I think this bug is very subtle. From my understanding, the proxy_impurity_improvement API expects the impurities to be "scaled back up":

scikit-learn/sklearn/tree/_criterion.pyx

Lines 165 to 168 in 9816b35

    
           self.children_impurity(&impurity_left, &impurity_right) 
        
           return (- self.weighted_n_right * impurity_right 
        
                   - self.weighted_n_left * impurity_left)

For the Poisson case, the weighted_n_* gets canceled out.

Is this your understanding of the bug as well?

sklearn/ensemble/tests/test_forest.py

sklearn/tree/_criterion.pyx

lorentzenchr · 2022-01-20T19:30:43Z

For the Poisson case, the weighted_n_* gets canceled out.
Is this your understanding of the bug as well?

Yes, the weights of the (candidate) left and right nodes get cancelled out in the lines

scikit-learn/sklearn/tree/_criterion.pyx

Lines 165 to 168 in 9816b35

    
           self.children_impurity(&impurity_left, &impurity_right) 
        
           return (- self.weighted_n_right * impurity_right 
        
                   - self.weighted_n_left * impurity_left)

Example: impurity = MSE = 1/n sum_i (y_i - y_pred_i)**2

For a single tree node, y_pred_i = const = y_mean = 1/n sum_i y_i and therefore MSE = 1/n sum_i (y_i - y_mean)**2 = var(y)
If one divides the node into left and right, one has different values for y_pred (left/right) and so the MSE of the parent can be rewritten as MSE(parent) = 1/n * (n_L * MSE(left) + n_R * MSE(right)) which can then be further simplified. The point is that n_L * MSE = sum of squares instead of mean of squares.

thomasjpfan

LGTM

glemaitre · 2022-01-24T09:24:52Z

I also see a comment at the beginning of the class regarding the proxy computation. Do you want to remove it then?

ogrisel · 2022-01-26T11:11:26Z

For information, I tried to test the impact of this PR on the poisson regression example : https://scikit-learn.org/stable/auto_examples/linear_model/plot_poisson_regression_non_normal_loss.html

I defined a poisson_rf model, similar to the poisson_gbrt` pipeline but with:

from sklearn.ensemble import RandomForestRegressor


poisson_rf = Pipeline(
    [
        ("preprocessor", tree_preprocessor),
        (
            "regressor",
            RandomForestRegressor(criterion="poisson", min_samples_leaf=10, n_jobs=-1),
        ),
    ]
)
poisson_rf.fit(
    df_train, df_train["Frequency"], regressor__sample_weight=df_train["Exposure"]
)

print("Poisson Random Forest evaluation:")
score_estimator(poisson_rf, df_test)

and then computed the deviance when training on 10% of the data (due to longer training times of RF compared to linear models and GBDT) and I found that:

a this PR can improve the Poisson deviance a bit (compared to main) but not by much (not sure if it's in the statistical noise);
b Poisson RF cannot compete with Poisson GBRT nor even linear Poisson regression on this data (or even Ridge regression), both in terms of Poisson deviance or Gini....
c MSE RF is not necessarily worse than Poisson RF in terms of Poisson deviance...

I find b a bit worrying but maybe this is expected?

Disclaimer: I did not run a full hyper-parameter optimization for RF models. In particular, it's possible that the RF models would perform better with more trees (but those are slow...).

Edit: I made a mistake in my first batch of experiments where I forgot to recompile when switching branch... but after re-compiling, the previous comments still mostly hold.

But in any case, I made the following surprising observation with RFs: on this example, the following RF:

            RandomForestRegressor(
                criterion="poisson", n_estimators=5, min_samples_leaf=1000, n_jobs=-1
            )

performs as good as the same RF with hundreds of trees and much better than deep RF with lower values for min_samples_leaf. With those hyperparameters, RF are competitive with linear models and GBRT.

I wonder if a simple averaging the y_hats in RF is optimal for Poisson regression.

ogrisel · 2022-01-26T13:31:25Z

I confirm that this PR fixes the case originally reported as #22186 (comment):

ogrisel

The code looks good, thanks for the new comments.

Based on #22191 (comment), the original problem is fixed, +1 for merge.

ogrisel · 2022-01-26T13:55:22Z

doc/whats_new/v1.1.rst

+- |Fix| Fix a bug in the Poisson splitting criterion for
+  :class:`tree.DecisionTreeRegressor`.
+  :pr:`22191` by :user:`Christian Lorentzen <lorentzenchr>`.
+


oops I realize that I merged too quickly: the sklearn.svm section has been split. Let me open a PR to fix this.

lorentzenchr added 2 commits January 11, 2022 19:00

FIX fix bug in poisson tree splitting

8d793f7

TST stricter assertions fo poisson tree

e2ccad6

github-actions bot added module:tree cython labels Jan 11, 2022

DOC add whatsnew

8b155dc

lorentzenchr added the Bug label Jan 11, 2022

lorentzenchr added this to the 1.1 milestone Jan 13, 2022

lorentzenchr added the Waiting for Reviewer label Jan 13, 2022

lorentzenchr added 3 commits January 14, 2022 00:01

FIX loosen assert a little

7e5a6da

Revert "FIX loosen assert a little"

f0ffd70

This reverts commit 7e5a6da.

TST fix random forest test test_poisson_vs_mse

9d09c02

lorentzenchr commented Jan 14, 2022

View reviewed changes

thomasjpfan reviewed Jan 19, 2022

View reviewed changes

sklearn/ensemble/tests/test_forest.py Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

lorentzenchr added 2 commits January 20, 2022 20:16

CLN remove debug print

2910430

DOC close parenthesis

7f0913f

thomasjpfan approved these changes Jan 22, 2022

View reviewed changes

lorentzenchr added 2 commits January 22, 2022 23:24

Merge branch 'main' into ifx_tree_poisson

2cafd8c

FIX whatsnew entry

5426d2c

ogrisel approved these changes Jan 26, 2022

View reviewed changes

ogrisel merged commit 2b15b90 into scikit-learn:main Jan 26, 2022

ogrisel reviewed Jan 26, 2022

View reviewed changes

ogrisel mentioned this pull request Jan 26, 2022

MAINT Follow-up on a fix for criterion="poisson" in decision trees #22306

Merged

lorentzenchr deleted the ifx_tree_poisson branch January 31, 2022 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX poisson proxy_impurity_improvement #22191

FIX poisson proxy_impurity_improvement #22191

Uh oh!

lorentzenchr commented Jan 11, 2022

Uh oh!

RAMitchell commented Jan 11, 2022

Uh oh!

lorentzenchr Jan 14, 2022

Uh oh!

lorentzenchr commented Jan 18, 2022

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Jan 20, 2022

Uh oh!

thomasjpfan left a comment

Uh oh!

glemaitre commented Jan 24, 2022

Uh oh!

ogrisel commented Jan 26, 2022 •

edited

Loading

Uh oh!

ogrisel commented Jan 26, 2022 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

ogrisel Jan 26, 2022

Uh oh!

Uh oh!

		proxy_impurity_left -= self.sum_left[k] * log(y_mean_left)
		proxy_impurity_right -= self.sum_right[k] * log(y_mean_right)

	self.children_impurity(&impurity_left, &impurity_right)

	return (- self.weighted_n_right * impurity_right
	- self.weighted_n_left * impurity_left)

Uh oh!

FIX poisson proxy_impurity_improvement #22191

FIX poisson proxy_impurity_improvement #22191

Uh oh!

Conversation

lorentzenchr commented Jan 11, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

RAMitchell commented Jan 11, 2022

Uh oh!

lorentzenchr Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented Jan 18, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Jan 20, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 24, 2022

Uh oh!

ogrisel commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Jan 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Jan 26, 2022 •

edited

Loading

ogrisel commented Jan 26, 2022 •

edited

Loading