[MRG+2] Add class_weight support to the forests & trees #3961

trevorstephens · 2014-12-12T01:23:51Z

This PR adds support for specifying class_weight in the forest classifier constructors.

Right now it only supports single-output classification problems. It would be possible however to accept a list of dicts for each target (or just the preset string 'auto' overall), and multiply the expanded weight vectors such that a sample with two minority classes becomes even more important, etc. I couldn't find any precedent in other classifiers to guide this.
Should an exception, warning or tantrum be thrown for an 'auto' class_weight used with the warm_start option?

Any other comments ?

coveralls · 2014-12-12T01:35:51Z

Coverage increased (+0.01%) when pulling de35bea on trevorstephens:rf-class_weight into 6dab7c5 on scikit-learn:master.

trevorstephens · 2014-12-12T02:52:55Z

Referencing #2129 and #64

trevorstephens · 2014-12-13T02:56:36Z

sklearn/ensemble/forest.py

+                                          'for multi-output. You may use '
+                                          'sample_weights in the fit method '
+                                          'to weight by sample.')
+


Further to my comment about multiplying the output's weights for multi-output, this is what I was thinking for replacing #L403-412:

cw = [] for k in range(self.n_outputs_): cw_part = compute_class_weight(self.class_weight, self.classes_[k], y_org[:, k]) cw_part = cw_part[np.searchsorted(self.classes_[k], y_org[:, k])] cw.append(cw_part) cw = np.prod(cw, axis=0, dtype=np.float64)

Another option would be to perform this action at the bootstrap level for 'auto' class_weights in the _parallel_build_trees method. However, this would make 'auto' act differently to user-defined class_weights, as the user-defined dict would be applied the same way regardless of the bootstrap sample.

trevorstephens · 2014-12-14T23:27:02Z

Added support for multi-output through the presets or use of a list of dicts for user-defined weights, ie [{-1: 0.5, 1: 1.}, {0: 1., 1: 1., 2: 2.}.

In decided whether to implement the "auto" weights at the full dataset level, or in the bootstrapping, I added another option for the user to enter class_weight="bootstrap" which will do the weighting based on the make-up of the bootstrap sample, while the "auto" option bases weights on the full dataset (and thus saves a bit of time by only doing it once).

These presets, as with the user-defined dicts, are multiplied together for multi-output in this implementation.

trevorstephens · 2014-12-17T03:55:52Z

Anyone have any thoughts on this so far?

amueller · 2014-12-17T19:15:37Z

sklearn/ensemble/forest.py

@@ -89,6 +87,27 @@ def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees,
        sample_counts = np.bincount(indices, minlength=n_samples)
        curr_sample_weight *= sample_counts

+        if class_weight == 'bootstrap':


Shouldn't this be implemented in the tree? If that gets the auto class weight it can compute it itself, right?

amueller · 2014-12-17T19:17:51Z

Thanks for your contribution.
I think as much as possible should be implemented in the DecisionTreeClassifier/Regressor. I think the only case where you need to do something in the forest is 'auto' where you want to use the whole training set, right?

trevorstephens · 2014-12-19T00:56:18Z

I will check it out, but I think that the tree will not be able to see the bootstrap sample directly, as the over/under-sampling is done by adjusting the counts of each sample by simply changing the weight to +2, +1, 0, etc, and this is multiplied by sample_weight which could mask some of the counts if computed at the tree level. The entire, un-bootstrapped y array (transformed away from the original labels) is passed to the tree still.

The "auto" and user-defined weights could be done in the tree perhaps, but that would require re-computing the same weights for every tree, which seems like redundant work.

I'll undertake a bit of an investigation all the same.

glouppe · 2014-12-19T07:43:52Z

Hi Trevor! First of all, sorry for the lack of feedback on our side. It seems like all tree-huggers are quite busy with other things.

Regarding the implementation of class weights, there are basically two different approaches, as I explained some time ago in #2129. The approach that you chose, exploiting sample weights only, or the one using class weights as priors in the computation of the impurity criteria. It should be checked but I am not sure both are exactly equivalent. Anyhow, the sample-weight based implementation is certainly the simplest to implement. So for the record, I am +1 for implementing this algorithm rather than modifying all criteria.

glouppe · 2014-12-19T07:49:09Z

Thanks for your contribution.
I think as much as possible should be implemented in the DecisionTreeClassifier/Regressor. I think the only case where you need to do something in the forest is 'auto' where you want to use the whole training set, right?

I agree. In general, whatever option we have in forests, we also have in single trees. This should be the case as well for this feature I believe.

Note that from an implementation point of view, redundant work can be avoided easily. E.g. by forcing from the forest class_weight=None in the trees that are built but passing the correctly balanced sample weights.

trevorstephens · 2014-12-20T18:38:57Z

Hey Gilles, thanks for the input, and no worries on delays, understand that everyone here is pretty heavily loaded.

I'm going to take a look through the tree code this weekend and see how class_weight could be implemented there as well, and how/if it can interact nicely with the forests.

I think we're on the same page, but the redundant work I mentioned was just that the class_weight computed across all samples (auto and user-defined modes) will be the same for every tree, and that throwing the class_weight at the tree will result in the same calculation being performed for every estimator in the ensemble, which seems like unnecessary overhead, though could be a 'nicer' paradigm perhaps. I certainly agree that class_weight should be implemented for DecisionTreeClassifier though, whether or not it is used by the forests.

Anyhow, I'll probably check back in after my lovely 20-hour commute back home to Australia this evening :)

trevorstephens · 2014-12-23T01:17:40Z

Added class_weight, using much the same code, to DecisionTreeClassifier except without the 'bootstrap' option. In terms of farming out the calculation of the class_weight to the trees though, I see a few issues:

User-defined weights are represented as a dict of labels and weights, these labels change as they are passed into the tree estimators (to 0, 1, 2, ...) so the dict would have to be transformed to match the new labels if it is to be understood by the individual tree estimators. Additionally, the calculation is the same for every estimator, so performing once at the ensemble level saves a touch of overhead in building each tree.
'auto' weights could be passed to the trees, but as with the user-defined weights, these would be the same for each tree, and so doing it at the ensemble level saves doing the same calculation multiple times.
'bootstrap' weights are done for each tree, but the bootstrap sample is represented as weights on the original y(s), which is multiplies by any sample_weight passed to the fit method. Thus, it could be hard to unravel what the bootstrap indices are once in the tree.

So I think keeping the class_weight calcs at the ensemble level and passing it to the individual trees as a re-balanced sample_weight makes more sense for all these cases. Interested to know what others think about the implementation though.

glouppe · 2014-12-27T10:57:56Z

So I think keeping the class_weight calcs at the ensemble level and
passing it to the individual trees as a re-balanced sample_weight
makes more sense for all these cases. Interested to know what others think
about the implementation though.

This is what I had in mind. +1 for implementing this this way.

trevorstephens · 2014-12-28T04:55:04Z

Great, thanks @glouppe

I'm working on validation of the class_weight parameter to raise more informative errors when appropriate. What are your thoughts on the warm_start option? I've noticed it's possible to pass a new X for additional estimators when using this option. BaseSGDClassifier raises a ValueError here for its partial_fit method. Should an error or warning be issued for warm_start when using the 'auto' or 'bootstrap' presets in the forests?

glouppe · 2014-12-28T09:47:57Z

Should an error or warning be issued for warm_start when using the 'auto' or 'bootstrap' presets in the forests?

I have no strong opinion on this, at the very least we should mention what happens in the docstring. Consistency with SGDClassifier is a good thing though.

glouppe · 2015-01-03T19:59:34Z

Is this still a work in progress? I can start reviewing the PR if you feel this is more or less ready.

trevorstephens · 2015-01-03T21:30:21Z

That would be great @glouppe , thanks.

Added tests for the class_weight parameter that raise errors when warranted. This is mostly for multi-output and valid strings as the compute_class_weight module does a number of checks that I am currently deferring to. I can add explicit tests for these cases if you think it is necessary.

I warn for the string presets and warm_start as it appears this has slightly different usage to SGDClassifier from reading through the relevant discussions in the original warm_start issues and PR.

glouppe · 2015-01-05T08:17:15Z

sklearn/utils/estimator_checks.py

@@ -737,6 +737,8 @@ def check_class_weight_classifiers(name, Classifier):
            classifier = Classifier(class_weight=class_weight)
        if hasattr(classifier, "n_iter"):
            classifier.set_params(n_iter=100)
+        if hasattr(classifier, "min_weight_fraction_leaf"):
+            classifier.set_params(min_weight_fraction_leaf=0.01)


It's an extremely noisy dataset that the trees seem to have a hard time fitting as the test, I believe, is written for linear models. That was to force it to learn a very small model. I could also get it to pass by forcing a decision stump here with max_depth=1 for instance.

maybe the test could also be reworked. The test passed for rbf svms, too, I think.
I wrote some of the class_weight tests, but I feel these are very hard to do. There are some examples with carefully crafted datasets, maybe these should be used instead?

Actually, this test seems to be fine to me. As long as there is any regularization, it should work. I guess the bootstrapping alone doesn't help enough as there are very few trees in the ensemble by default.

Yup, there's also only 2 features in this dataset too, so the randomized feature selection doesn't have much of a chance to generalize. Happy to make any mods that are deemed necessary though. Admit this is a bit of a hack to get a passing grade from Travis, though the module(s) do look a bit harder through my new Iris tests in tree and forest.

I think your fix is ok.

trevorstephens · 2015-01-08T00:24:06Z

Any other comments on the PR @amueller & @glouppe ?

glouppe · 2015-01-08T07:46:11Z

sklearn/ensemble/forest.py

@@ -377,8 +406,9 @@ def _set_oob_score(self, X, y):

        self.oob_score_ = oob_score / self.n_outputs_

-    def _validate_y(self, y):
-        y = np.copy(y)
+    def _validate_y_cw(self, y_org):


Can you be explicit and call the variable y_original?

glouppe · 2015-01-08T07:54:02Z

Besides the small cosmit, I am +1 for merge. Tests are thorough. Could you also add an entry in the whatsnew file?

Thanks for your work @trevorstephens! This is a very helpful addition :)

(and sorry for the lack of responsiveness these days...)

trevorstephens · 2015-01-08T10:36:36Z

Thanks @glouppe ! Requested changes are made. Second reviewer anyone (while I figure out how to get around a lovely 11th hour merge conflict) ... ?

amueller · 2015-01-09T19:31:20Z

I was not concerned about performance but code duplication.

trevorstephens · 2015-01-09T19:52:22Z

Ah. I made a few comments about this a couple of weeks back: #3961 (comment)

TL;DR: It may be possible, but would result in identical calculations for every tree where this way we only do it once.

amueller · 2015-01-09T19:53:43Z

never mind then, 👍 from me.

trevorstephens · 2015-01-09T20:22:16Z

Great, thanks!

Any way to kick off Travis again?

GaelVaroquaux · 2015-01-09T20:23:41Z

I restarted the failed job.

trevorstephens · 2015-01-09T20:32:13Z

@GaelVaroquaux thanks!

trevorstephens · 2015-01-09T22:49:40Z

Thanks for the reviews @amueller and @glouppe , looks like Travis is happy now, we good to merge?

GaelVaroquaux · 2015-01-10T10:34:52Z

sklearn/ensemble/forest.py

@@ -211,11 +232,17 @@ def fit(self, X, y, sample_weight=None):

        self.n_outputs_ = y.shape[1]

-        y = self._validate_y(y)
+        y, cw = self._validate_y_cw(y)


Could we have a more explicite name for what "cw" means (for the variable name and the function call).

trevorstephens · 2015-01-10T20:01:26Z

@GaelVaroquaux , thanks for looking it over! Let me know what you think of the latest.

trevorstephens · 2015-01-12T01:22:08Z

@amueller @glouppe @GaelVaroquaux & others

I've been thinking about rolling class_weight out to the other ensembles in a future PR (so as not to undo the much-appreciated reviews so far) and have an API question. Forests use a bootstrap sample, but GradientBoostingClassifier uses sampling without replacement, while the BaggingClassifier allows both bootstrapping and sampling without replacement (AdaBoostClassifier doesn't implement any of these, so no issue there). Should the current class_weight='bootstrap' option for RandomForestClassifier be renamed to something less specific for consistency across ensemble classifiers? Maybe 'subsample', 'sample', 'estimator' or something else?

Sorry to bring this up after flipping from [WIP], but it seems important to decide on before merge given the other estimators in the module.

GaelVaroquaux · 2015-01-12T06:36:05Z

@trevorstephens : that's a very good comment. Maybe I would use 'subsample', which I find is less confusing than 'estimator'.

GaelVaroquaux · 2015-01-12T06:40:15Z

sklearn/ensemble/forest.py


        if getattr(y, "dtype", None) != DOUBLE or not y.flags.contiguous:
            y = np.ascontiguousarray(y, dtype=DOUBLE)

+        if expanded_class_weight is not None:
+            if sample_weight is not None:
+                sample_weight = np.copy(sample_weight) * expanded_class_weight


The 'np.copy' doesn't seem necessary above.

Conflicts: doc/whats_new.rst sklearn/ensemble/forest.py sklearn/tree/tree.py

trevorstephens · 2015-01-13T05:38:18Z

Thanks for the comments @GaelVaroquaux . Renamed the class_weight='bootstrap' option to 'subsample' and implemented the other changes you suggested. Let me know what you think.

coveralls · 2015-01-13T05:38:34Z

Coverage increased (+0.02%) when pulling 35c2535 on trevorstephens:rf-class_weight into 57544ae on scikit-learn:master.

GaelVaroquaux · 2015-01-13T06:34:37Z

LGTM.

Two 👍 and my review. I think that this can be merged.

Merging. Thanks!

[MRG+2] Add class_weight support to the forests & trees

trevorstephens · 2015-01-13T06:43:10Z

Thanks! Cheers!

add support for class_weights

de35bea

trevorstephens reviewed Dec 13, 2014
View reviewed changes

trevorstephens added 2 commits December 14, 2014 13:31

add multioutput support & bootstrap auto mode

b131ad2

expanded class_weight dimension fix

4c77ae8

amueller reviewed Dec 17, 2014
View reviewed changes

add class_weight to trees, expand tests & minor refactor

085d677

trevorstephens changed the title ~~[WIP] Add support for class_weight to the forests~~ [WIP] Add class_weight support to the forests & trees Dec 23, 2014

parameter validation checks & tests for errors

2b24cce

trevorstephens changed the title ~~[WIP] Add class_weight support to the forests & trees~~ [MRG] Add class_weight support to the forests & trees Jan 3, 2015

glouppe reviewed Jan 5, 2015
View reviewed changes

glouppe reviewed Jan 8, 2015
View reviewed changes

Y-org rename & whats_new update

ac9ab96

trevorstephens changed the title ~~[MRG+1] Add class_weight support to the forests & trees~~ [MRG+2] Add class_weight support to the forests & trees Jan 9, 2015

GaelVaroquaux reviewed Jan 10, 2015
View reviewed changes

rename vars & copy sample_weight

cad87b5

GaelVaroquaux reviewed Jan 12, 2015
View reviewed changes

trevorstephens added 2 commits January 12, 2015 19:15

Merge branch 'master' into rf-class_weight

b541191

Conflicts: doc/whats_new.rst sklearn/ensemble/forest.py sklearn/tree/tree.py

rename cw option to subsample & refactor its implementation

35c2535

GaelVaroquaux added a commit that referenced this pull request Jan 13, 2015

Merge pull request #3961 from trevorstephens/rf-class_weight

527ecf5

[MRG+2] Add class_weight support to the forests & trees

GaelVaroquaux merged commit 527ecf5 into scikit-learn:master Jan 13, 2015

trevorstephens deleted the rf-class_weight branch January 13, 2015 06:42

amueller mentioned this pull request Jan 13, 2015

Implement class_weights parameter for Random Forests #2129

Closed

trevorstephens mentioned this pull request Jan 17, 2015

[MRG] class_weight for Bagging, AdaBoost & GradientBoosting classifiers #4114

Closed

This was referenced Feb 1, 2015

MRG+2: Refactor - Farm out class_weight calcs to .utils #4190

Merged

[MRG] GBM & meta-ensembles - support for class_weight #4215

Closed

trevorstephens mentioned this pull request Jun 10, 2015

[MRG+1] Add sample_weight support to RidgeClassifier #4838

Merged

Uh oh!

[MRG+2] Add class_weight support to the forests & trees #3961

[MRG+2] Add class_weight support to the forests & trees #3961

Uh oh!

Conversation

trevorstephens commented Dec 12, 2014

Uh oh!

coveralls commented Dec 12, 2014

Uh oh!

trevorstephens commented Dec 12, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trevorstephens commented Dec 14, 2014

Uh oh!

trevorstephens commented Dec 17, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Dec 17, 2014

Uh oh!

trevorstephens commented Dec 19, 2014

Uh oh!

glouppe commented Dec 19, 2014

Uh oh!

glouppe commented Dec 19, 2014

Uh oh!

trevorstephens commented Dec 20, 2014

Uh oh!

trevorstephens commented Dec 23, 2014

Uh oh!

glouppe commented Dec 27, 2014

Uh oh!

trevorstephens commented Dec 28, 2014

Uh oh!

glouppe commented Dec 28, 2014

Uh oh!

glouppe commented Jan 3, 2015

Uh oh!

trevorstephens commented Jan 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trevorstephens commented Jan 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glouppe commented Jan 8, 2015

Uh oh!

trevorstephens commented Jan 8, 2015

Uh oh!

amueller commented Jan 9, 2015

Uh oh!

trevorstephens commented Jan 9, 2015

Uh oh!

amueller commented Jan 9, 2015

Uh oh!

trevorstephens commented Jan 9, 2015

Uh oh!

GaelVaroquaux commented Jan 9, 2015

Uh oh!

trevorstephens commented Jan 9, 2015

Uh oh!

trevorstephens commented Jan 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trevorstephens commented Jan 10, 2015

Uh oh!

trevorstephens commented Jan 12, 2015