-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Add class_weight support to the forests & trees #3961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Add class_weight support to the forests & trees #3961
Conversation
'for multi-output. You may use ' | ||
'sample_weights in the fit method ' | ||
'to weight by sample.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further to my comment about multiplying the output's weights for multi-output, this is what I was thinking for replacing #L403-412:
cw = []
for k in range(self.n_outputs_):
cw_part = compute_class_weight(self.class_weight,
self.classes_[k],
y_org[:, k])
cw_part = cw_part[np.searchsorted(self.classes_[k],
y_org[:, k])]
cw.append(cw_part)
cw = np.prod(cw, axis=0, dtype=np.float64)
Another option would be to perform this action at the bootstrap level for 'auto' class_weight
s in the _parallel_build_trees
method. However, this would make 'auto' act differently to user-defined class_weight
s, as the user-defined dict would be applied the same way regardless of the bootstrap sample.
Added support for multi-output through the presets or use of a list of dicts for user-defined weights, ie In decided whether to implement the "auto" weights at the full dataset level, or in the bootstrapping, I added another option for the user to enter These presets, as with the user-defined dicts, are multiplied together for multi-output in this implementation. |
Anyone have any thoughts on this so far? |
@@ -89,6 +87,27 @@ def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees, | |||
sample_counts = np.bincount(indices, minlength=n_samples) | |||
curr_sample_weight *= sample_counts | |||
|
|||
if class_weight == 'bootstrap': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be implemented in the tree? If that gets the auto class weight it can compute it itself, right?
Thanks for your contribution. |
I will check it out, but I think that the The "auto" and user-defined weights could be done in the tree perhaps, but that would require re-computing the same weights for every tree, which seems like redundant work. I'll undertake a bit of an investigation all the same. |
Hi Trevor! First of all, sorry for the lack of feedback on our side. It seems like all tree-huggers are quite busy with other things. Regarding the implementation of class weights, there are basically two different approaches, as I explained some time ago in #2129. The approach that you chose, exploiting sample weights only, or the one using class weights as priors in the computation of the impurity criteria. It should be checked but I am not sure both are exactly equivalent. Anyhow, the sample-weight based implementation is certainly the simplest to implement. So for the record, I am +1 for implementing this algorithm rather than modifying all criteria. |
I agree. In general, whatever option we have in forests, we also have in single trees. This should be the case as well for this feature I believe. Note that from an implementation point of view, redundant work can be avoided easily. E.g. by forcing from the forest class_weight=None in the trees that are built but passing the correctly balanced sample weights. |
Hey Gilles, thanks for the input, and no worries on delays, understand that everyone here is pretty heavily loaded. I'm going to take a look through the I think we're on the same page, but the redundant work I mentioned was just that the Anyhow, I'll probably check back in after my lovely 20-hour commute back home to Australia this evening :) |
Added
So I think keeping the |
|
Great, thanks @glouppe I'm working on validation of the |
I have no strong opinion on this, at the very least we should mention what happens in the docstring. Consistency with SGDClassifier is a good thing though. |
Is this still a work in progress? I can start reviewing the PR if you feel this is more or less ready. |
That would be great @glouppe , thanks. Added tests for the I warn for the string presets and |
@@ -737,6 +737,8 @@ def check_class_weight_classifiers(name, Classifier): | |||
classifier = Classifier(class_weight=class_weight) | |||
if hasattr(classifier, "n_iter"): | |||
classifier.set_params(n_iter=100) | |||
if hasattr(classifier, "min_weight_fraction_leaf"): | |||
classifier.set_params(min_weight_fraction_leaf=0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the test could also be reworked. The test passed for rbf svms, too, I think.
I wrote some of the class_weight tests, but I feel these are very hard to do. There are some examples with carefully crafted datasets, maybe these should be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this test seems to be fine to me. As long as there is any regularization, it should work. I guess the bootstrapping alone doesn't help enough as there are very few trees in the ensemble by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, there's also only 2 features in this dataset too, so the randomized feature selection doesn't have much of a chance to generalize. Happy to make any mods that are deemed necessary though. Admit this is a bit of a hack to get a passing grade from Travis, though the module(s) do look a bit harder through my new Iris tests in tree
and forest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your fix is ok.
@@ -377,8 +406,9 @@ def _set_oob_score(self, X, y): | |||
|
|||
self.oob_score_ = oob_score / self.n_outputs_ | |||
|
|||
def _validate_y(self, y): | |||
y = np.copy(y) | |||
def _validate_y_cw(self, y_org): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be explicit and call the variable y_original
?
Besides the small cosmit, I am +1 for merge. Tests are thorough. Could you also add an entry in the whatsnew file? Thanks for your work @trevorstephens! This is a very helpful addition :) (and sorry for the lack of responsiveness these days...) |
Thanks @glouppe ! Requested changes are made. Second reviewer anyone (while I figure out how to get around a lovely 11th hour merge conflict) ... ? |
I was not concerned about performance but code duplication. |
Ah. I made a few comments about this a couple of weeks back: #3961 (comment) TL;DR: It may be possible, but would result in identical calculations for every tree where this way we only do it once. |
never mind then, 👍 from me. |
Great, thanks! Any way to kick off Travis again? |
I restarted the failed job. |
@GaelVaroquaux thanks! |
@@ -211,11 +232,17 @@ def fit(self, X, y, sample_weight=None): | |||
|
|||
self.n_outputs_ = y.shape[1] | |||
|
|||
y = self._validate_y(y) | |||
y, cw = self._validate_y_cw(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a more explicite name for what "cw" means (for the variable name and the function call).
@GaelVaroquaux , thanks for looking it over! Let me know what you think of the latest. |
@amueller @glouppe @GaelVaroquaux & others I've been thinking about rolling Sorry to bring this up after flipping from [WIP], but it seems important to decide on before merge given the other estimators in the module. |
@trevorstephens : that's a very good comment. Maybe I would use 'subsample', which I find is less confusing than 'estimator'. |
|
||
if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous: | ||
y = np.ascontiguousarray(y, dtype=DOUBLE) | ||
|
||
if expanded_class_weight is not None: | ||
if sample_weight is not None: | ||
sample_weight = np.copy(sample_weight) * expanded_class_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'np.copy' doesn't seem necessary above.
Conflicts: doc/whats_new.rst sklearn/ensemble/forest.py sklearn/tree/tree.py
Thanks for the comments @GaelVaroquaux . Renamed the |
LGTM. Two 👍 and my review. I think that this can be merged. Merging. Thanks! |
[MRG+2] Add class_weight support to the forests & trees
Thanks! Cheers! |
This PR adds support for specifying
class_weight
in the forest classifier constructors.class_weight
used with thewarm_start
option?Any other comments ?