Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@NicolasHug
Copy link
Member

In the tree Splitter we call impurity_improvement() and then children_impurity().

But impurity_improvement() itself will call children_impurity() internally so this PR removes the duplicated work.

CC @lorentzenchr @thomasjpfan

cdef double impurity_left
cdef double impurity_right

self.children_impurity(&impurity_left, &impurity_right)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the call that was removed.

Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NicolasHug!

I think that with this PR, we can avoid duplicate calculations.

Comment on lines 437 to +440
self.criterion.children_impurity(&best.impurity_left,
&best.impurity_right)
best.improvement = self.criterion.impurity_improvement(
impurity, best.impurity_left, best.impurity_right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, moving the call to children_impurity before impurity_improvement allow to pass the impurity of the children. Therefore, we avoid these duplicate calculations:

cdef double impurity_left
cdef double impurity_right
self.children_impurity(&impurity_left, &impurity_right)

Am I right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so!

Co-authored-by: Juan Carlos Alfaro Jiménez <[email protected]>

return diff * diff / (self.weighted_n_left * self.weighted_n_right)

cdef double impurity_improvement(self, double impurity) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change here?
Are we calling only the proxy_impurity_improvement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm it seems that with this change we will use the MSE impurity improvement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. I didn't realize Friedman MSE used a non-conventional improvement. Which makes me going through a rabbit hole, wondering whether this makes sense at all, but that's irrelevant for this PR. I put it back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went down the same rabbit hole. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you get out? I'm still super confused about so many things. Like does friedman_mse make sense outside of GBDTs, and do we really want to allow a MAE splitting criteria when we already have the LAD loss... so many questions lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not planning to get out anytime soon. There isn't many references of this criterion outside of https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... Here are my notes so far. I'll open an issue when I have a better idea of this all, but I'm happy to sync with you prior!

Does it really make sense to allow a criterion to be passed to GBDT? All trees
should be built via LS anyway. Friedman does mention e.g. for LAD that trees
could be built with LAD criterion but LS is just much faster.

WTF is friedman_mse?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥕 ?
@NicolasHug I was thinking/doubting about friedman_mse myself and would appreciate an issue if you come up with one.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


return diff * diff / (self.weighted_n_left * self.weighted_n_right)

cdef double impurity_improvement(self, double impurity) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went down the same rabbit hole. :)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @NicolasHug Nice catch!

@lorentzenchr
Copy link
Member

@NicolasHug Would you like a what's new entry? It's supposedly a slight performance improvement.

@glemaitre
Copy link
Member

@NicolasHug Would you like a what's new entry? It's supposedly a slight performance improvement.

I think we are fine merging as it is.

@NicolasHug Feel free to open an issue on the Friedman MSE. They are some relics in the code that only some old wizard knows about :) (even git blame does not help there)

@glemaitre glemaitre merged commit 22f232e into scikit-learn:master Aug 21, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants