Tweedie deviance loss for tree based models #16668

lorentzenchr · 2020-03-10T21:53:31Z

Describe the workflow you want to enable

If the target y is (approximately) Poisson, Gamma or else Tweedie distributed, it would be beneficial for tree based regressors to support Tweedie deviance loss functions as splitting criterion. This partially addresses #5975.

Describe your proposed solution

Ideally, one first implements

differentiable loss functions A common private module for differentiable loss functions used as objective functions in estimators #15123

and then adds the different loss criteria to the tree based models:

DecisionTreeRegressor (poisson only) [MRG] ENH add Poisson splitting criterion for single trees #17386
RandomForestRegressor (poisson only) ENH Adds Poisson criterion in RandomForestRegressor #19304 #19836
GradientBoostingRegressor
HistGradientBoostingRegressor (poisson and gamma but no other tweedie cases) ENH Poisson loss for HistGradientBoostingRegressor #16692

Open for Discussion

For Poisson and Tweedie deviance with 1<=power<2, ther target y may be zero while the prediction y_pred must be strictly larger than zero. A tree might find a split where one node has y=0 for all samples in that node, resulting naively in y_pred = mean(y) = 0 for that node. I see 3 different solutions to that:

Use a log-link function, i.e. predict y_pred = np.exp(tree)
See ENH Poisson loss for HistGradientBoostingRegressor #16692 for HistGradientBoostingRegressor. This may be no option for DecisionTreeRegressor.
Use a splitting rule that forbids splits where one node has sum(y)=0.
One might also introduce some option like min_y_weight, such that splits with sum(sample_weight*y) < min_y_weight are forbidden.
Use some form of parent child average y_pred = a * mean(y) + (1-a) * y_pred_parent and forbid further splits, see [1].
(Bayes/credibility theory motivates to set a = sum(sample_weight*y)/(gamma+sum(sample_weight*y)) for some hyperparameter gamma.)

There is also a dirty solution that allows y_pred=0 but sets the value min(eps, y_pred) in the loss function for some tiny value of eps.

References

[1] R rpart library, chapter 8 Poisson regression

The text was updated successfully, but these errors were encountered:

ogrisel · 2020-03-13T14:25:07Z

I would start with HistGradientBoostingRegressor which is order of magnitude more scalable and therefore useful that the other models.

The Gradient Boosting models already have the notion of link functions tied to their loss. Tthis is necessary to Bernoulli / logit expected values for binary classifications, and Categorical / softmax expected values for multiclass classification for instance. At the moment the link function is hard-coded for each loss and I think this is fine as a first step.

+1 for using the log link for Tweedie / Poisson / Gamma by default.

ogrisel · 2020-03-13T14:26:18Z

For pure tree based and RF models, indeed we will need something more hackish but I would focus on gradient boosting first.

ogrisel · 2020-03-13T14:26:55Z

/cc @NicolasHug @adrinjalali @glemaitre @rth

NicolasHug · 2020-03-13T14:37:48Z

I'd be happy to review a PR

(I'd also love to actually submit one but it's better if I review, given that the number of reviewers for the HGBDT is pretty small).

BTW @ogrisel if you could take a look at #15582 that'd be awesome :)

lorentzenchr · 2020-03-13T15:54:16Z

@ogrisel Fast random forests are also on my personal wish list as they provide an excellent no-brainer baseline model.

For HistGradientBoostingRegressor would we go just for Poisson and Gamma or for the whole Tweedie distributions with 1<=power<=2? Also, If you'd like, I can give it a shot.

NicolasHug · 2020-03-13T16:03:10Z

just for Poisson and Gamma or for the whole Tweedie distributions

Let's keep things simple for now and not introduce the power parameter (we don't have that kind of flexible loss API yet)

If you'd like, I can give it a shot.

Yes please!!

ogrisel · 2020-03-13T16:51:59Z

Compound Poisson Gamma p strictly in (1,2) is nice too because it allows to have a mixture of a point mass at zero and a continuous on R+ which is not possible to model with either Poisson or Gamma alone (plus the optimal variance function is not necessarily linear or quadratic).

I think we could have an extra "power" constructor param (set to None by default) that would only be used when `loss="tweedie_deviance"?

NicolasHug · 2020-03-13T17:10:52Z

I agree poisson gamma is useful, but I was hoping that for the first PR we would just have non-controversial content (i.e. with no API change), to avoid long discussions like for the GLM, at least for now

ogrisel · 2020-03-13T17:11:34Z

Alright.

ogrisel · 2020-03-13T17:16:44Z

@ogrisel Fast random forests are also on my personal wish list as they provide an excellent no-brainer baseline model.

You mean make scikit-learn RF faster by implementing histogram-based splits?

Or having a fast implementation RF with Poisson/Gamma/Tweedie response variables by adding such ?

Nowadays, whenever I tried HistGradientBoostingClassifier/Regressor vs RandomForestClassifier/Regressor both with default hyperparams, the former was always the fastest with better predictive performance. Hyper-parameter tuning often imrpoves a bit but is often not critical. So to me HistGradientBoostingClassifier/Regressor is the new no-brainer baseline.

lorentzenchr · 2020-03-14T09:27:40Z

@ogrisel Can you read minds? Around the same time you answered yesterday, a friend of mine came up with the idea of a histogram random forest 😄 That would be great to have, but any RF with a poisson splitting criterion will do. So far, they are hard to find in the ecosystem.

Reksbril · 2020-04-27T07:16:59Z

@ogrisel @lorentzenchr can I work on the other cases, or someone is already doing so?

lorentzenchr added the New Feature label Mar 10, 2020

cmarmo added module:tree module:ensemble labels Mar 11, 2020

lorentzenchr mentioned this issue Mar 14, 2020

ENH Poisson loss for HistGradientBoostingRegressor #16692

Merged

rth mentioned this issue Mar 30, 2020

Poisson, gamma and tweedie family of loss functions #5975

Closed

lorentzenchr mentioned this issue May 29, 2020

[MRG] ENH add Poisson splitting criterion for single trees #17386

Merged

rth mentioned this issue Sep 24, 2020

Minimize memory usage in Tree based models by allowing for change of internal dtype #18448

Open

glemaitre added this to Losses and solvers May 17, 2024

glemaitre moved this to Discussion in Losses and solvers May 17, 2024

ogrisel mentioned this issue Jun 11, 2024

DOC Add quantile loss to user guide on HGBT regression #29063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tweedie deviance loss for tree based models #16668

Tweedie deviance loss for tree based models #16668

lorentzenchr commented Mar 10, 2020 •

edited by ogrisel

Loading

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020 •

edited

Loading

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

lorentzenchr commented Mar 13, 2020

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020 •

edited

Loading

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

lorentzenchr commented Mar 14, 2020

Uh oh!

Reksbril commented Apr 27, 2020 •

edited

Loading

Uh oh!

Uh oh!

Tweedie deviance loss for tree based models #16668

Tweedie deviance loss for tree based models #16668

Comments

lorentzenchr commented Mar 10, 2020 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the workflow you want to enable

Describe your proposed solution

Open for Discussion

References

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

lorentzenchr commented Mar 13, 2020

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

ogrisel commented Mar 13, 2020

Uh oh!

lorentzenchr commented Mar 14, 2020

Uh oh!

Reksbril commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Mar 10, 2020 •

edited by ogrisel

Loading

ogrisel commented Mar 13, 2020 •

edited

Loading

ogrisel commented Mar 13, 2020 •

edited

Loading

Reksbril commented Apr 27, 2020 •

edited

Loading