-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] FIX Correct depth formula in iforest #8576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Would it be reasonable to add a non-regression test? |
@jnothman the change is made in a private method for the IsolationForest class and it appears that the tests in: scikit-learn/sklearn/ensemble/tests/test_iforest.py are passing. |
yes, but they were paying when it was broken too. so we should add a test
which checks for the correct behaviour.
…On 13 Mar 2017 1:14 am, "Peter Wang" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> the change is made in a private
method for the IsolationForest class and it appears that the tests in:
scikit-learn/sklearn/ensemble/tests/test_iforest.py are passing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8576 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz66rvCh1lc4aD4-MgHFUZ67zSmsETks5rk_2_gaJpZM4MabcV>
.
|
Thanks for the test. Does it fail in |
A whatsnew and this is good to go... |
Also needs to be added to list of models with changed behaviour in what's new
|
doc/whats_new.rst
Outdated
@@ -18,7 +18,9 @@ parameters, may produce different models from the previous version. This often | |||
occurs due to changes in the modelling logic (bug fixes or enhancements), or in | |||
random sampling procedures. | |||
|
|||
* *to be listed* | |||
- Made a change to :class:`sklearn.ensemble.IsolationForest` by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll just have to list the class :class:
sklearn.ensemble.IsolationForest` and the user is expected to ctrl-f it out... Confirm with @jnothman though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think
* :class:`sklearn.ensemble.IsolationForest` (bug fix)
would more than suffice.
sklearn/ensemble/iforest.py
Outdated
@@ -300,7 +300,7 @@ def _average_path_length(n_samples_leaf): | |||
if n_samples_leaf <= 1: | |||
return 1. | |||
else: | |||
return 2. * (np.log(n_samples_leaf) + 0.5772156649) - 2. * ( | |||
return 2. * (np.log(n_samples_leaf - 1.) + 0.5772156649) - 2. * ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please use np.euler_gamma
instead of 0.57721...
# for average path length | ||
|
||
assert_almost_equal(_average_path_length(1), 1., decimal=10) | ||
assert_almost_equal(_average_path_length(5), 2.327020052, decimal=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather this written out as:
2*np.log(4) + 2 * np.euler_gamma − (2 * 4/5)
unless you've got 2.327020052
straight from some reference table.
doc/whats_new.rst
Outdated
@@ -28,7 +28,7 @@ cannot assure that this list is complete.) | |||
Changelog | |||
--------- | |||
|
|||
New features | |||
- New features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be reverted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something weird has happened here.
sklearn/ensemble/iforest.py
Outdated
@@ -314,7 +314,7 @@ def _average_path_length(n_samples_leaf): | |||
|
|||
average_path_length[mask] = 1. | |||
average_path_length[not_mask] = 2. * ( | |||
np.log(n_samples_leaf[not_mask]) + 0.5772156649) - 2. * ( | |||
np.log(n_samples_leaf[not_mask]) + np.euler_gamma) - 2. * ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW is it correct to not subtract 1 here? @ngoix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! so much for our LGTMs... look a bit wider in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its subtract 1 in lines 303-304
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be done here too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a separate test for average_path_length
testing equivalence between the integer and array cases. Please add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look a bit wider in future?
Indeed. Sorry for not being alert to that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. ensure _average_path_length(999) == _average_path_length(np.array([999]))
LGTM. Will merge once CI approves |
sklearn/ensemble/iforest.py
Outdated
@@ -300,7 +301,7 @@ def _average_path_length(n_samples_leaf): | |||
if n_samples_leaf <= 1: | |||
return 1. | |||
else: | |||
return 2. * (np.log(n_samples_leaf) + 0.5772156649) - 2. * ( | |||
return 2. * (np.log(n_samples_leaf - 1.) + np.euler_gamma) - 2. * ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now you should reference just euler_gamma
, not np.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small enhancement request... With that I'm done here... Thx
assert_almost_equal(_average_path_length(999), result_two, decimal=10) | ||
|
||
|
||
def test_average_path_length_arr_int(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this test and add to the previous test?
assert_array_almost_equal(_...(np.array([1, 5, 999])), [1., result_one, result_two]), deci...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
;) I guess we don't want to do that and have users complaint at numpy "only if I import sklearn, I get |
sklearn/utils/fixes.py
Outdated
@@ -36,6 +36,9 @@ def _parse_version(version_string): | |||
version.append(x) | |||
return tuple(version) | |||
|
|||
euler_gamma = getattr(np, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np
and euler_gamma
could be put in a single line if you did this for pep8 line limit?
assert_almost_equal(_average_path_length(1), 1., decimal=10) | ||
assert_almost_equal(_average_path_length(5), result_one, decimal=10) | ||
assert_almost_equal(_average_path_length(999), result_two, decimal=10) | ||
assert_almost_equal(_average_path_length(5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is now redundant and can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patience!
|
||
result_one = 2. * (np.log(4.) + euler_gamma) - 2. * 4. / 5. | ||
result_two = 2. * (np.log(998.) + euler_gamma) - 2. * 998. / 999. | ||
assert_array_almost_equal(_average_path_length(np.array([1, 5, 999])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if it was unclear, I meant to ask for the removal of the redundant _average_path_length(np.array([1]) == _average_path_length(1)
line...
We still need to test the int arguments as they are being handled in a different line of code than if the argument is an array...
i.e. Could you add back the assert_almost_equal(_average_path_length(1), result_one)
and assert...(5), result_two)
lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The current tests would pass happily even if you revert the changes done at https://github.com/scikit-learn/scikit-learn/pull/8576/files#diff-522aed8770bec9fb385e859d53c63983R304)
Thanks a lot @PtrWang! |
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
* Fixed depth formula in iforest * Added non-regression test for issue scikit-learn#8549 * reverted some whitespace changes * Made changes to what's new and whitespace changes * Update whats_new.rst * Update whats_new.rst * fixed faulty whitespace * faulty whitespace fix and change to whats new * added constants to iforest average_path_length and the according non regression test * COSMIT * Update whats_new.rst * Corrected IsolationForest average path formula and added integer array equiv test * changed line to under 80 char * Update whats_new.rst * Update whats_new.rst * reran tests * redefine np.euler_gamma * added import statement for euler_gammma in iforest and test_iforest * changed np.euler_gamma to euler_gamma * fix small formatting issue * fix small formatting issue * modified average_path_length tests * formatting fix + removed redundant tests * fix import error * retry remote server error * retry remote server error * retry remote server error * re-added some iforest tests * re-added some iforest tests
Reference Issue
What does this implement/fix? Explain your changes.
Fixed issue #8549
Any other comments?