FIX always scale continuous features to unit variance in mutual info #24747

glemaitre · 2022-10-24T18:13:18Z

The continuous features in X should be scaled to a unit variance independently if y is continuous or discrete. This is an implementation detail reported in [1].

[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.

doc/whats_new/v1.2.rst

betatim · 2022-10-25T07:57:06Z

sklearn/feature_selection/tests/test_mutual_info.py

+    Non-regression test for:
+    https://github.com/scikit-learn/scikit-learn/issues/23720
+    """
+    rng = np.random.RandomState(0)


Should this use the global seed fixture (#22827)?

I am wondering if this is not overkilled here.

¯_(ツ)_/¯

Part of my question is to figure out what other people's feeling is for when to use the global seed thing and when not :)

IMO, this is to test some numerical stability issues triggered in corner cases.
Here, I would not have necessarily done it because the two input vectors used to compute the MI will be used exactly with the same functions underneath.

betatim

LGTM, just one comment about global seeds or not

Co-authored-by: Tim Head <[email protected]>

betatim · 2022-10-25T08:20:03Z

Is it worth having a test that checks that the MI is symmetric also when the inputs are "correlated"? Something like:

In [18]: d_c = c.astype(int)

In [21]: mutual_info_classif(
    ...:         c[:, None], d_c, discrete_features=[False], random_state=123
    ...:     )
Out[21]: array([0.92509398])

In [22]: mutual_info_regression(
    ...:         d_c[:, None], c, discrete_features=[True], random_state=123
    ...:     )
Out[22]: array([0.92509398])

glemaitre · 2022-10-25T08:23:15Z

Is it worth having a test that checks that the MI is symmetric also when the inputs are "correlated"?

Yep we can add this test as well.

betatim · 2022-10-25T08:37:20Z

Maybe with a "high" and "low" MI test there is no need to test different seeds?

glemaitre · 2022-10-25T08:39:23Z

Maybe with a "high" and "low" MI test there is no need to test different seeds?

OK, I see what you want to test. Let's try the fixture then. Indeed, it will use all random seeds only when requested (I always forget about it).

thomasjpfan

Thank you for the PR!

doc/whats_new/v1.2.rst

sklearn/feature_selection/tests/test_mutual_info.py

Co-authored-by: Thomas J. Fan <[email protected]>

sklearn/feature_selection/tests/test_mutual_info.py

jeremiedbb

LGTM

FIX always scale continuous features to unit variance in mutual info

2c603ae

glemaitre added this to the 1.2 milestone Oct 24, 2022

github-actions bot added the module:feature_selection label Oct 24, 2022

DOC update pr number

31d200b

betatim reviewed Oct 25, 2022

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

betatim reviewed Oct 25, 2022

View reviewed changes

betatim approved these changes Oct 25, 2022

View reviewed changes

Apply suggestions from code review

8fc6824

Co-authored-by: Tim Head <[email protected]>

address Tim comments

8d18d3a

thomasjpfan reviewed Oct 25, 2022

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved

Update doc/whats_new/v1.2.rst

0965bc2

Co-authored-by: Thomas J. Fan <[email protected]>

thomasjpfan reviewed Oct 26, 2022

View reviewed changes

sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved

sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved

sklearn/feature_selection/tests/test_mutual_info.py Outdated Show resolved Hide resolved

jeremiedbb added 3 commits November 16, 2022 10:58

address review comments

0037d2a

Merge remote-tracking branch 'upstream/main' into pr/glemaitre/24747

8936b6f

trigger ci ?

2fb8d62

jeremiedbb reviewed Nov 16, 2022

View reviewed changes

jeremiedbb approved these changes Nov 16, 2022

View reviewed changes

betatim approved these changes Nov 16, 2022

View reviewed changes

jeremiedbb merged commit 86301ac into scikit-learn:main Nov 16, 2022

Uh oh!

FIX always scale continuous features to unit variance in mutual info #24747

FIX always scale continuous features to unit variance in mutual info #24747

Uh oh!

Conversation

glemaitre commented Oct 24, 2022

Uh oh!

Uh oh!

betatim Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

betatim commented Oct 25, 2022

Uh oh!

glemaitre commented Oct 25, 2022

Uh oh!

betatim commented Oct 25, 2022

Uh oh!

glemaitre commented Oct 25, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

glemaitre Oct 25, 2022 •

edited

Loading