Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Proxy improvement methods added to entropy/gini #5233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jmschrei
Copy link
Member

@jmschrei jmschrei commented Sep 9, 2015

This pull request handles implementing explicit proxy improvement methods for entropy and gini criterion, to compliment the one added in #5203. Entropy is changed by refactoring the equation

-weighted_n_left / weighted_n_total * -sum( count_left_i / weighted_n_left * log( count_left_i / weighted_n_left ) - weighted_n_right / weighted_n_total * -sum( count_i / weighted_n_right * log(count_right_i / weighted_n_right )

into

sum( count_left_i * log( count_left_i ) ) + sum( count_right_i * log( count_right_i ) ) - weighted_n_left * log(weighted_n_left) - weighted_n_right * log(weighted_n_right)

which collapses some terms, and caches the calculation involving weighted_n_left and weighted_n_right.

The gini calculation time is collapsed into

sum( count_left_i ** 2.0 ) / weighted_n_left + sum( count_right_i ** 2.0 ) / weighted_n_right

Here are time tests. Entropy seems to take basically the same amount of time, whereas gini seems to run slightly faster. I was unsure if this was worth merging, so figured I'd see what you had to say.

@glouppe @arjoly @ogrisel

ENTROPY
BRANCH
RandomForestClassifier
spambase   0.94 0.081
Gaussian   0.922 7.972
mnist      0.946 9.669
covtypes   0.948 19.574

ExtraTreesClassifier
spambase   0.943 0.047
Gaussian   0.924 0.846
mnist      0.952 4.678
covtypes   0.943 12.225

DecisionTreeClassifier
spambase   0.911 0.065
Gaussian   0.739 17.631
mnist      0.887 30.058
covtypes   0.948 11.025

MASTER
RandomForestClassifier
spambase   0.94 0.081
Gaussian   0.922 8.333
mnist      0.946 9.324
covtypes   0.948 17.701

ExtraTreesClassifier
spambase   0.943 0.047
Gaussian   0.924 0.79
mnist      0.952 4.542
covtypes   0.943 11.81

DecisionTreeClassifier
spambase   0.911 0.06
Gaussian   0.739 16.75
mnist      0.887 29.83
covtypes   0.948 10.561



GINI
BRANCH
RandomForestClassifier
Gaussian   0.914 3.931
spambase   0.941 0.045
mnist      0.948 5.016
covtypes   0.944 13.238

ExtraTreesClassifier
Gaussian   0.918 0.652
spambase   0.946 0.041
mnist      0.947 4.641
covtypes   0.942 11.275

DecisionTreeClassifier
Gaussian   0.727 8.971
spambase   0.902 0.056
mnist      0.873 21.287
covtypes   0.944 8.989

MASTER
RandomForestClassifier
Gaussian   0.914 4.513
spambase   0.939 0.052
mnist      0.949 5.27
covtypes   0.944 13.767

ExtraTreesClassifier
Gaussian   0.917 0.679
spambase   0.947 0.042
mnist      0.952 5.131
covtypes   0.94 11.591

DecisionTreeClassifier
Gaussian   0.741 10.02
spambase   0.898 0.052
mnist      0.872 24.078
covtypes   0.943 9.836


@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

Can you add benchmark using covertype and mnist ?
In the benchmark folder, they are two pre-made scripts where you specify the algorithm that you want to benchmark.


cdef double wl_log_wl = weighted_n_left * log(weighted_n_left)
cdef double wr_log_wr = weighted_n_right * log(weighted_n_right)
cdef double entropy = -wl_log_wl - wr_log_wr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the three previous line could be combine without loss of clarity.

@arjoly
Copy link
Member

arjoly commented Sep 9, 2015

Hm travis is not happy. I tried that briefly in #5220 and got beaten by exactly the same error. I believe that this is due to numerical instabilities and to approximations of big numbers by the "float system".

@jmschrei jmschrei mentioned this pull request Sep 9, 2015
12 tasks
@arjoly
Copy link
Member

arjoly commented Sep 10, 2015

I have check if we have the same numerical instabilities for regression, but it doesn't seem to be the case

diff --git a/sklearn/ensemble/tests/test_forest.py b/sklearn/ensemble/tests/test_forest.py
index e12f52d..67b7e77 100644
--- a/sklearn/ensemble/tests/test_forest.py
+++ b/sklearn/ensemble/tests/test_forest.py
@@ -188,9 +188,9 @@ def test_probability():
 def check_importances(name, X, y):
     # Check variable importances.

-    ForestClassifier = FOREST_CLASSIFIERS[name]
+    ForestEstimator = FOREST_ESTIMATORS[name]
     for n_jobs in [1, 2]:
-        clf = ForestClassifier(n_estimators=10, n_jobs=n_jobs)
+        clf = ForestEstimator(n_estimators=10, n_jobs=n_jobs)
         clf.fit(X, y)
         importances = clf.feature_importances_
         n_important = np.sum(importances > 0.1)
@@ -204,12 +204,12 @@ def check_importances(name, X, y):
         sample_weight = np.ones(y.shape)
         sample_weight[y == 1] *= 100

-        clf = ForestClassifier(n_estimators=50, n_jobs=n_jobs, random_state=0)
+        clf = ForestEstimator(n_estimators=50, n_jobs=n_jobs, random_state=0)
         clf.fit(X, y, sample_weight=sample_weight)
         importances = clf.feature_importances_
         assert_true(np.all(importances >= 0.0))

-        clf = ForestClassifier(n_estimators=50, n_jobs=n_jobs, random_state=0)
+        clf = ForestEstimator(n_estimators=50, n_jobs=n_jobs, random_state=0)
         clf.fit(X, y, sample_weight=3 * sample_weight)
         importances_bis = clf.feature_importances_
         assert_almost_equal(importances, importances_bis)
@@ -221,7 +221,7 @@ def test_importances():
                                         n_repeated=0, shuffle=False,
                                         random_state=0)

-    for name in FOREST_CLASSIFIERS:
+    for name in list(FOREST_CLASSIFIERS) + list(FOREST_REGRESSORS):
         yield check_importances, name, X, y


PS: note that random_state is not set for the first estimator.

@glouppe
Copy link
Contributor

glouppe commented Sep 10, 2015

I dont have much time today to dig into this, but it is very important to make sure that variable importances converge to their true theoretical values. (We should add a test for that -- I'll do it.)

In addition, variable importances should be invariant with respect to scaling the sample weights, as check_importances is checking. Not sure how to mitigate that, maybe by rescaling the given sample weights before fitting? (e.g. by dividing by sample_weight.max())

@jmschrei
Copy link
Member Author

The more trees you pass in, the closer to the theoretical values you get, but you don't get them exactly.


for k in range(n_outputs):
gini_left = 0.0
gini_right = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange, this should be remove. No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. That's weird that multi-output tests were successful with this.

@jmschrei
Copy link
Member Author

I am not getting a significant speed up on covtypes or mnist, and there are other errors, so I am going to close this PR.

@jmschrei jmschrei closed this Sep 10, 2015
@arjoly
Copy link
Member

arjoly commented Sep 10, 2015

Thanks for your hard work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants