-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG+1] ENH: Feature selection based on mutual information #5372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
df08def
54c0783
0245fc8
c1aea3f
689ed0d
835102a
8394c1b
ad2f5f5
ffc4fe9
824dda3
051d3a2
7869992
ec17289
b0491be
d3a497a
375b070
d60636a
094a077
5b3f515
b48a108
e1bc056
a36edf2
e716c64
daa73c7
4cc82a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
""" | ||
=========================================== | ||
Comparison of F-test and mutual information | ||
=========================================== | ||
|
||
This example illustrates the differences between univariate F-test statistics | ||
and mutual information. | ||
|
||
We consider 3 features x_1, x_2, x_3 distributed uniformly over [0, 1], the | ||
target depends on them as follows: | ||
|
||
y = x_1 + sin(6 * pi * x_2) + 0.1 * N(0, 1), that is the third features is completely irrelevant. | ||
|
||
The code below plots the dependency of y against individual x_i and normalized | ||
values of univariate F-tests statistics and mutual information. | ||
|
||
As F-test captures only linear dependency, it rates x_1 as the most | ||
discriminative feature. On the other hand, mutual information can capture any | ||
kind of dependency between variables and it rates x_2 as the most | ||
discriminative feature, which probably agrees better with our intuitive | ||
perception for this example. Both methods correctly marks x_3 as irrelevant. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice example, thanks :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this more intuitive? because the variance of |
||
""" | ||
print(__doc__) | ||
|
||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
from sklearn.feature_selection import f_regression, mutual_info_regression | ||
|
||
np.random.seed(0) | ||
X = np.random.rand(1000, 3) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should fix the random state though There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is that necessary? This example gives very similar results for any sample. I think it's a rather good thing, when an example is robust in this sense. Don't you agree? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, but it is sometimes weird while rebuilding the documentation you find the plot has changed. |
||
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000) | ||
|
||
f_test, _ = f_regression(X, y) | ||
f_test /= np.max(f_test) | ||
|
||
mi = mutual_info_regression(X, y) | ||
mi /= np.max(mi) | ||
|
||
plt.figure(figsize=(15, 5)) | ||
for i in range(3): | ||
plt.subplot(1, 3, i + 1) | ||
plt.scatter(X[:, i], y) | ||
plt.xlabel("$x_{}$".format(i + 1), fontsize=14) | ||
if i == 0: | ||
plt.ylabel("$y$", fontsize=14) | ||
plt.title("F-test={:.2f}, MI={:.2f}".format(f_test[i], mi[i]), | ||
fontsize=16) | ||
plt.show() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,4 +33,4 @@ | |
plt.matshow(ranking, cmap=plt.cm.Blues) | ||
plt.colorbar() | ||
plt.title("Ranking of pixels with RFE") | ||
plt.show() | ||
plt.show() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. newline |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,17 +22,22 @@ | |
|
||
from .from_model import SelectFromModel | ||
|
||
from .mutual_info_ import mutual_info_regression, mutual_info_classif | ||
|
||
|
||
__all__ = ['GenericUnivariateSelect', | ||
'RFE', | ||
'RFECV', | ||
'SelectFdr', | ||
'SelectFpr', | ||
'SelectFwe', | ||
'SelectKBest', | ||
'SelectFromModel', | ||
'SelectPercentile', | ||
'VarianceThreshold', | ||
'chi2', | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why did you remove this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like I moved it after |
||
'f_classif', | ||
'f_oneway', | ||
'f_regression', | ||
'SelectFromModel'] | ||
'mutual_info_classif', | ||
'mutual_info_regression'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add this to the narrative doc, to help users know when to use what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I think someone needs to write a section explaining what are F-tests, chi2 (and mutual info), when they applicable and how they different. But I suggest to delegate it to another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds great, thank you very much, but is about over fitting example, do
you have one?
On Fri, Jan 15, 2016 at 10:48 AM, Nikolay Mayorov [email protected]
wrote:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed but just a line stating the same thing in the user section, ie MI also captures non-linear dependence won't hurt for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 and maybe add an issue to track the larger doc problem