[MRG] Proxy improvement methods added to entropy/gini #5233

jmschrei · 2015-09-09T14:33:30Z

This pull request handles implementing explicit proxy improvement methods for entropy and gini criterion, to compliment the one added in #5203. Entropy is changed by refactoring the equation

-weighted_n_left / weighted_n_total * -sum( count_left_i / weighted_n_left * log( count_left_i / weighted_n_left ) - weighted_n_right / weighted_n_total * -sum( count_i / weighted_n_right * log(count_right_i / weighted_n_right )

into

sum( count_left_i * log( count_left_i ) ) + sum( count_right_i * log( count_right_i ) ) - weighted_n_left * log(weighted_n_left) - weighted_n_right * log(weighted_n_right)

which collapses some terms, and caches the calculation involving weighted_n_left and weighted_n_right.

The gini calculation time is collapsed into

sum( count_left_i ** 2.0 ) / weighted_n_left + sum( count_right_i ** 2.0 ) / weighted_n_right

Here are time tests. Entropy seems to take basically the same amount of time, whereas gini seems to run slightly faster. I was unsure if this was worth merging, so figured I'd see what you had to say.

@glouppe @arjoly @ogrisel

ENTROPY
BRANCH
RandomForestClassifier
spambase   0.94 0.081
Gaussian   0.922 7.972
mnist      0.946 9.669
covtypes   0.948 19.574

ExtraTreesClassifier
spambase   0.943 0.047
Gaussian   0.924 0.846
mnist      0.952 4.678
covtypes   0.943 12.225

DecisionTreeClassifier
spambase   0.911 0.065
Gaussian   0.739 17.631
mnist      0.887 30.058
covtypes   0.948 11.025

MASTER
RandomForestClassifier
spambase   0.94 0.081
Gaussian   0.922 8.333
mnist      0.946 9.324
covtypes   0.948 17.701

ExtraTreesClassifier
spambase   0.943 0.047
Gaussian   0.924 0.79
mnist      0.952 4.542
covtypes   0.943 11.81

DecisionTreeClassifier
spambase   0.911 0.06
Gaussian   0.739 16.75
mnist      0.887 29.83
covtypes   0.948 10.561



GINI
BRANCH
RandomForestClassifier
Gaussian   0.914 3.931
spambase   0.941 0.045
mnist      0.948 5.016
covtypes   0.944 13.238

ExtraTreesClassifier
Gaussian   0.918 0.652
spambase   0.946 0.041
mnist      0.947 4.641
covtypes   0.942 11.275

DecisionTreeClassifier
Gaussian   0.727 8.971
spambase   0.902 0.056
mnist      0.873 21.287
covtypes   0.944 8.989

MASTER
RandomForestClassifier
Gaussian   0.914 4.513
spambase   0.939 0.052
mnist      0.949 5.27
covtypes   0.944 13.767

ExtraTreesClassifier
Gaussian   0.917 0.679
spambase   0.947 0.042
mnist      0.952 5.131
covtypes   0.94 11.591

DecisionTreeClassifier
Gaussian   0.741 10.02
spambase   0.898 0.052
mnist      0.872 24.078
covtypes   0.943 9.836

arjoly · 2015-09-09T14:38:19Z

Can you add benchmark using covertype and mnist ?
In the benchmark folder, they are two pre-made scripts where you specify the algorithm that you want to benchmark.

arjoly · 2015-09-09T14:49:51Z

sklearn/tree/_criterion.pyx

+
+        cdef double wl_log_wl = weighted_n_left * log(weighted_n_left)
+        cdef double wr_log_wr = weighted_n_right * log(weighted_n_right)
+        cdef double entropy = -wl_log_wl - wr_log_wr


I think that the three previous line could be combine without loss of clarity.

arjoly · 2015-09-09T15:02:16Z

Hm travis is not happy. I tried that briefly in #5220 and got beaten by exactly the same error. I believe that this is due to numerical instabilities and to approximations of big numbers by the "float system".

arjoly · 2015-09-10T07:04:31Z

I have check if we have the same numerical instabilities for regression, but it doesn't seem to be the case

diff --git a/sklearn/ensemble/tests/test_forest.py b/sklearn/ensemble/tests/test_forest.py
index e12f52d..67b7e77 100644
--- a/sklearn/ensemble/tests/test_forest.py
+++ b/sklearn/ensemble/tests/test_forest.py
@@ -188,9 +188,9 @@ def test_probability():
 def check_importances(name, X, y):
     # Check variable importances.

-    ForestClassifier = FOREST_CLASSIFIERS[name]
+    ForestEstimator = FOREST_ESTIMATORS[name]
     for n_jobs in [1, 2]:
-        clf = ForestClassifier(n_estimators=10, n_jobs=n_jobs)
+        clf = ForestEstimator(n_estimators=10, n_jobs=n_jobs)
         clf.fit(X, y)
         importances = clf.feature_importances_
         n_important = np.sum(importances > 0.1)
@@ -204,12 +204,12 @@ def check_importances(name, X, y):
         sample_weight = np.ones(y.shape)
         sample_weight[y == 1] *= 100

-        clf = ForestClassifier(n_estimators=50, n_jobs=n_jobs, random_state=0)
+        clf = ForestEstimator(n_estimators=50, n_jobs=n_jobs, random_state=0)
         clf.fit(X, y, sample_weight=sample_weight)
         importances = clf.feature_importances_
         assert_true(np.all(importances >= 0.0))

-        clf = ForestClassifier(n_estimators=50, n_jobs=n_jobs, random_state=0)
+        clf = ForestEstimator(n_estimators=50, n_jobs=n_jobs, random_state=0)
         clf.fit(X, y, sample_weight=3 * sample_weight)
         importances_bis = clf.feature_importances_
         assert_almost_equal(importances, importances_bis)
@@ -221,7 +221,7 @@ def test_importances():
                                         n_repeated=0, shuffle=False,
                                         random_state=0)

-    for name in FOREST_CLASSIFIERS:
+    for name in list(FOREST_CLASSIFIERS) + list(FOREST_REGRESSORS):
         yield check_importances, name, X, y

PS: note that random_state is not set for the first estimator.

glouppe · 2015-09-10T07:20:14Z

I dont have much time today to dig into this, but it is very important to make sure that variable importances converge to their true theoretical values. (We should add a test for that -- I'll do it.)

In addition, variable importances should be invariant with respect to scaling the sample weights, as check_importances is checking. Not sure how to mitigate that, maybe by rescaling the given sample weights before fitting? (e.g. by dividing by sample_weight.max())

jmschrei · 2015-09-10T07:33:03Z

The more trees you pass in, the closer to the theoretical values you get, but you don't get them exactly.

arjoly · 2015-09-10T08:47:47Z

sklearn/tree/_criterion.pyx

+
+        for k in range(n_outputs):
+            gini_left = 0.0
+            gini_right = 0.0


strange, this should be remove. No?

Ah, yes. That's weird that multi-output tests were successful with this.

jmschrei · 2015-09-10T13:22:28Z

I am not getting a significant speed up on covtypes or mnist, and there are other errors, so I am going to close this PR.

arjoly · 2015-09-10T13:37:36Z

Thanks for your hard work !

ENH proxy impurity added to ent/gini

f1d1612

arjoly reviewed Sep 9, 2015
View reviewed changes

jmschrei mentioned this pull request Sep 9, 2015

[RFC] Tree module improvements #5212

Open

12 tasks

arjoly mentioned this pull request Sep 10, 2015

FIX unstable test #5239 + improve coverage of importance tests #5240

Closed

arjoly reviewed Sep 10, 2015
View reviewed changes

jmschrei closed this Sep 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Proxy improvement methods added to entropy/gini #5233

[MRG] Proxy improvement methods added to entropy/gini #5233

Uh oh!

jmschrei commented Sep 9, 2015

Uh oh!

arjoly commented Sep 9, 2015

Uh oh!

arjoly Sep 9, 2015

Uh oh!

arjoly commented Sep 9, 2015

Uh oh!

arjoly commented Sep 10, 2015

Uh oh!

glouppe commented Sep 10, 2015

Uh oh!

jmschrei commented Sep 10, 2015

Uh oh!

arjoly Sep 10, 2015

Uh oh!

jmschrei Sep 10, 2015

Uh oh!

jmschrei commented Sep 10, 2015

Uh oh!

arjoly commented Sep 10, 2015

Uh oh!

Uh oh!

Uh oh!

[MRG] Proxy improvement methods added to entropy/gini #5233

[MRG] Proxy improvement methods added to entropy/gini #5233

Uh oh!

Conversation

jmschrei commented Sep 9, 2015

Uh oh!

arjoly commented Sep 9, 2015

Uh oh!

arjoly Sep 9, 2015

Choose a reason for hiding this comment

Uh oh!

arjoly commented Sep 9, 2015

Uh oh!

arjoly commented Sep 10, 2015

Uh oh!

glouppe commented Sep 10, 2015

Uh oh!

jmschrei commented Sep 10, 2015

Uh oh!

arjoly Sep 10, 2015

Choose a reason for hiding this comment

Uh oh!

jmschrei Sep 10, 2015

Choose a reason for hiding this comment

Uh oh!

jmschrei commented Sep 10, 2015

Uh oh!

arjoly commented Sep 10, 2015

Uh oh!

Uh oh!