Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@glemaitre
Copy link
Member

closes #23720

The continuous features in X should be scaled to a unit variance independently if y is continuous or discrete. This is an implementation detail reported in [1].

[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.

@glemaitre glemaitre added this to the 1.2 milestone Oct 24, 2022
Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/23720
"""
rng = np.random.RandomState(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the global seed fixture (#22827)?

Copy link
Member Author

@glemaitre glemaitre Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is not overkilled here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯_(ツ)_/¯

Part of my question is to figure out what other people's feeling is for when to use the global seed thing and when not :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is to test some numerical stability issues triggered in corner cases.
Here, I would not have necessarily done it because the two input vectors used to compute the MI will be used exactly with the same functions underneath.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment about global seeds or not

@betatim
Copy link
Member

betatim commented Oct 25, 2022

Is it worth having a test that checks that the MI is symmetric also when the inputs are "correlated"? Something like:

In [18]: d_c = c.astype(int)

In [21]: mutual_info_classif(
    ...:         c[:, None], d_c, discrete_features=[False], random_state=123
    ...:     )
Out[21]: array([0.92509398])

In [22]: mutual_info_regression(
    ...:         d_c[:, None], c, discrete_features=[True], random_state=123
    ...:     )
Out[22]: array([0.92509398])

@glemaitre
Copy link
Member Author

Is it worth having a test that checks that the MI is symmetric also when the inputs are "correlated"?

Yep we can add this test as well.

@betatim
Copy link
Member

betatim commented Oct 25, 2022

Maybe with a "high" and "low" MI test there is no need to test different seeds?

@glemaitre
Copy link
Member Author

Maybe with a "high" and "low" MI test there is no need to test different seeds?

OK, I see what you want to test. Let's try the fixture then. Indeed, it will use all random seeds only when requested (I always forget about it).

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Co-authored-by: Thomas J. Fan <[email protected]>
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb merged commit 86301ac into scikit-learn:main Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Symmetry of mutual_info_classif and mutual_info_regression.

4 participants