-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
FIX always scale continuous features to unit variance in mutual info #24747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Non-regression test for: | ||
| https://github.com/scikit-learn/scikit-learn/issues/23720 | ||
| """ | ||
| rng = np.random.RandomState(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use the global seed fixture (#22827)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this is not overkilled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯_(ツ)_/¯
Part of my question is to figure out what other people's feeling is for when to use the global seed thing and when not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is to test some numerical stability issues triggered in corner cases.
Here, I would not have necessarily done it because the two input vectors used to compute the MI will be used exactly with the same functions underneath.
betatim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment about global seeds or not
Co-authored-by: Tim Head <[email protected]>
|
Is it worth having a test that checks that the MI is symmetric also when the inputs are "correlated"? Something like: In [18]: d_c = c.astype(int)
In [21]: mutual_info_classif(
...: c[:, None], d_c, discrete_features=[False], random_state=123
...: )
Out[21]: array([0.92509398])
In [22]: mutual_info_regression(
...: d_c[:, None], c, discrete_features=[True], random_state=123
...: )
Out[22]: array([0.92509398]) |
Yep we can add this test as well. |
|
Maybe with a "high" and "low" MI test there is no need to test different seeds? |
OK, I see what you want to test. Let's try the fixture then. Indeed, it will use all random seeds only when requested (I always forget about it). |
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Co-authored-by: Thomas J. Fan <[email protected]>
jeremiedbb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
closes #23720
The continuous features in
Xshould be scaled to a unit variance independently ifyis continuous or discrete. This is an implementation detail reported in [1].[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.