Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] ENH: Feature selection based on mutual information #5372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
df08def
ENH: Implemented mutual_info function
Oct 23, 2013
54c0783
DOC: Documentation update related to mutual_info
Dec 12, 2015
0245fc8
MAINT: Use six.moves.zip in mutual_info
Dec 12, 2015
c1aea3f
MAINT: Renamed module mutual_info to mutual_info_
Dec 14, 2015
689ed0d
DOC: Example for mutual_information
nmayorov Jan 10, 2016
835102a
API: Split mutual_info into _regression and _classif
nmayorov Jan 12, 2016
8394c1b
MAINT: Add blank lines between parameters in mutual_info_.py
nmayorov Jan 15, 2016
ad2f5f5
MAINT: Add check_classification_targets to mutual_info_classif
nmayorov Jan 15, 2016
ffc4fe9
TST: Change tolerance checks in test_mutual_info.py
nmayorov Jan 15, 2016
824dda3
MAINT: Small changes to plot_f_test_vs_mi.py
nmayorov Jan 15, 2016
051d3a2
MAINT: Slightly improve logic of discrete-continuous MI estimation
nmayorov Jan 15, 2016
7869992
MAINT: Slightly improve copy logic in _estimate_mi
nmayorov Jan 15, 2016
ec17289
DOC: Add short descriptions of methods for mutual info estimation
nmayorov Jan 15, 2016
b0491be
DOC: Add a short explanation of F-test vs MI in narrative doc
nmayorov Jan 16, 2016
d3a497a
BUG: Fix copy logic for mutual info functions
nmayorov Jan 16, 2016
375b070
TST: Speed up 2 tests related to mutual info
nmayorov Jan 17, 2016
d60636a
DOC: Small fixes in mutual_info_.py documentation
nmayorov Jan 17, 2016
094a077
MAINT: Small refactoring in mutual_info_.py
nmayorov Jan 17, 2016
5b3f515
MAINT: Get rid of classes in test_mutual_info.py
nmayorov Jan 17, 2016
b48a108
DOC: Add one more reference for mutual info methods
nmayorov Jan 20, 2016
e1bc056
MAINT: Add a clarification comment in mutual_info_.py
nmayorov Jan 20, 2016
a36edf2
DOC: Modify SelectKBest and SelectPercentile docstrings slightly
nmayorov Jan 20, 2016
e716c64
MAINT: Mention mutual info methods in whats_new.rst
nmayorov Jan 20, 2016
daa73c7
BUG: Remove non-ASCII symbols from mutual_info_.py
nmayorov Jan 20, 2016
4cc82a3
MAINT: Modify whats_new item related to mutual information
nmayorov Jan 21, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -534,6 +534,8 @@ From text
feature_selection.chi2
feature_selection.f_classif
feature_selection.f_regression
feature_selection.mutual_info_classif
feature_selection.mutual_info_regression


.. _gaussian_process_ref:
Expand Down
27 changes: 18 additions & 9 deletions doc/modules/feature_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ as objects that implement the ``transform`` method:
:class:`SelectFdr`, or family wise error :class:`SelectFwe`.

* :class:`GenericUnivariateSelect` allows to perform univariate feature
selection with a configurable strategy. This allows to select the best
univariate selection strategy with hyper-parameter search estimator.
selection with a configurable strategy. This allows to select the best
univariate selection strategy with hyper-parameter search estimator.

For instance, we can perform a :math:`\chi^2` test to the samples
to retrieve only the two best features as follows:
Expand All @@ -84,17 +84,24 @@ to retrieve only the two best features as follows:
>>> X_new.shape
(150, 2)

These objects take as input a scoring function that returns
univariate p-values:
These objects take as input a scoring function that returns univariate scores
and p-values (or only scores for :class:`SelectKBest` and
:class:`SelectPercentile`):

* For regression: :func:`f_regression`
* For regression: :func:`f_regression`, :func:`mutual_info_regression`

* For classification: :func:`chi2` or :func:`f_classif`
* For classification: :func:`chi2`, :func:`f_classif`, :func:`mutual_info_classif`

The methods based on F-test estimate the degree of linear dependency between
two random variables. On the other hand, mutual information methods can capture
any kind of statistical dependency, but being nonparametric, they require more
samples for accurate estimation.

.. topic:: Feature selection with sparse data

If you use sparse data (i.e. data represented as sparse matrices),
only :func:`chi2` will deal with the data without making it dense.
:func:`chi2`, :func:`mutual_info_regression`, :func:`mutual_info_classif`
will deal with the data without making it dense.

.. warning::

Expand All @@ -103,7 +110,9 @@ univariate p-values:

.. topic:: Examples:

:ref:`example_feature_selection_plot_feature_selection.py`
* :ref:`example_feature_selection_plot_feature_selection.py`

* :ref:`example_feature_selection_plot_f_test_vs_mi.py`

.. _rfe:

Expand Down Expand Up @@ -315,4 +324,4 @@ Then, a :class:`sklearn.ensemble.RandomForestClassifier` is trained on the
transformed output, i.e. using only relevant features. You can perform
similar operations with the other feature selection methods and also
classifiers that provide a way to evaluate feature importances of course.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
See the :class:`sklearn.pipeline.Pipeline` examples for more details.
10 changes: 10 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ Changelog
New features
............

- Added two functions for mutual information estimation:
:func:`feature_selection.mutual_info_classif` and
:func:`feature_selection.mutual_info_regression`. These functions can be
used in :class:`feature_selection.SelectKBest` and
:class:`feature_selection.SelectPercentile`, which now accept callable
returning only `scores`. By `Andrea Bravi`_ and `Nikolay Mayorov`_.

- The Gaussian Process module has been reimplemented and now offers classification
and regression estimators through :class:`gaussian_process.GaussianProcessClassifier`
and :class:`gaussian_process.GaussianProcessRegressor`. Among other things, the new
Expand Down Expand Up @@ -4037,3 +4044,6 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Imaculate: https://github.com/Imaculate

.. _Bernardo Stein: https://github.com/DanielSidhion

.. _Andrea Bravi: https://github.com/AndreaBravi

49 changes: 49 additions & 0 deletions examples/feature_selection/plot_f_test_vs_mi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""
===========================================
Comparison of F-test and mutual information
===========================================

This example illustrates the differences between univariate F-test statistics
and mutual information.

We consider 3 features x_1, x_2, x_3 distributed uniformly over [0, 1], the
target depends on them as follows:

y = x_1 + sin(6 * pi * x_2) + 0.1 * N(0, 1), that is the third features is completely irrelevant.

The code below plots the dependency of y against individual x_i and normalized
values of univariate F-tests statistics and mutual information.

As F-test captures only linear dependency, it rates x_1 as the most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add this to the narrative doc, to help users know when to use what

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think someone needs to write a section explaining what are F-tests, chi2 (and mutual info), when they applicable and how they different. But I suggest to delegate it to another PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds great, thank you very much, but is about over fitting example, do
you have one?

On Fri, Jan 15, 2016 at 10:48 AM, Nikolay Mayorov [email protected]
wrote:

In examples/feature_selection/plot_f_test_vs_mi.py
#5372 (comment)
:

+===========================================
+Comparison of F-test and mutual information
+===========================================
+
+This example illustrates the differences between univariate F-test statistics
+and mutual information.
+
+We consider 3 features x_1, x_2, x_3 distributed uniformly over [0, 1], the
+target depends on them as follows:
+
+y = x_1 + sin(6 * pi * x_2) + 0.1 * N(0, 1), that is the third features is completely irrelevant.
+
+The code below plots the dependency of y against individual x_i and normalized
+values of univariate F-tests statistics and mutual information.
+
+As F-test captures only linear dependency, it rates x_1 as the most

Honestly, I think someone needs to write a section explaining what are
F-tests, chi2 (and mutual info), whey there applicable and how they
different. But I suggest to delegate it to another PR.


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/5372/files#r49867717.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed but just a line stating the same thing in the user section, ie MI also captures non-linear dependence won't hurt for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and maybe add an issue to track the larger doc problem

discriminative feature. On the other hand, mutual information can capture any
kind of dependency between variables and it rates x_2 as the most
discriminative feature, which probably agrees better with our intuitive
perception for this example. Both methods correctly marks x_3 as irrelevant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example, thanks :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this more intuitive? because the variance of p(y | x2) is smaller?

"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import f_regression, mutual_info_regression

np.random.seed(0)
X = np.random.rand(1000, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should fix the random state though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that necessary? This example gives very similar results for any sample. I think it's a rather good thing, when an example is robust in this sense. Don't you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it is sometimes weird while rebuilding the documentation you find the plot has changed.

y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)

f_test, _ = f_regression(X, y)
f_test /= np.max(f_test)

mi = mutual_info_regression(X, y)
mi /= np.max(mi)

plt.figure(figsize=(15, 5))
for i in range(3):
plt.subplot(1, 3, i + 1)
plt.scatter(X[:, i], y)
plt.xlabel("$x_{}$".format(i + 1), fontsize=14)
if i == 0:
plt.ylabel("$y$", fontsize=14)
plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]),
fontsize=16)
plt.show()

2 changes: 1 addition & 1 deletion examples/feature_selection/plot_rfe_digits.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,4 @@
plt.matshow(ranking, cmap=plt.cm.Blues)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

7 changes: 6 additions & 1 deletion sklearn/feature_selection/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,22 @@

from .from_model import SelectFromModel

from .mutual_info_ import mutual_info_regression, mutual_info_classif


__all__ = ['GenericUnivariateSelect',
'RFE',
'RFECV',
'SelectFdr',
'SelectFpr',
'SelectFwe',
'SelectKBest',
'SelectFromModel',
'SelectPercentile',
'VarianceThreshold',
'chi2',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I moved it after SelectKBest. I probably wanted all functions to be in the end.

'f_classif',
'f_oneway',
'f_regression',
'SelectFromModel']
'mutual_info_classif',
'mutual_info_regression']
Loading