Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Add check for n_components in pca #10042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 2, 2017

Conversation

CoderPat
Copy link
Contributor

Reference Issues/PRs

Fixes #10034

What does this implement/fix? Explain your changes.

Previously, passing a n_components bigger than 1 that wasn't a integer would originate confusing errors. I fixed this by doing explicit verification and rasing a ValueError for it (+ tests ofcourse)

Any other comments?

I opted for doing it in the _fit method, but could have been done in _fit_full or _fit_truncate, where other verifications are done. Can change if reviewers prefer

@CoderPat CoderPat changed the title Add check for n_components in pca [MRG] Add check for n_components in pca Oct 29, 2017
@jnothman
Copy link
Member

LGTM. Happy to merge when Appveyor passes.

@jnothman jnothman changed the title [MRG] Add check for n_components in pca [MRG+1] Add check for n_components in pca Oct 30, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And perhaps @qinhanmin2014 has a point about putting the code in _fit_full and _fit_truncated. I'm ambivalent. WDYT, @CoderPat?

@@ -389,6 +390,12 @@ def test_pca_validation():
PCA(n_components, svd_solver=solver)
.fit, data)

n_components = 1.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we please test for different solvers?

Copy link
Contributor Author

@CoderPat CoderPat Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
It might make sense then to add the checks on individual solvers (see the discussion/comments in the issue).

@jnothman jnothman changed the title [MRG+1] Add check for n_components in pca [MRG] Add check for n_components in pca Oct 30, 2017
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal suggestions for your reference :)
Still doubt whether it is the right place for the check but you should follow the decision from core devs.

(n_components > 1 and
not (np.issubdtype(type(n_components), np.integer))):
raise ValueError("n_components=%r must be of type int "
"when bigger than 1, was of type=%r"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might worth to mention greater than or equal to 1 in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought it would fall on the the float case, but it has to be lesser than 1, but just checked you are right. My bad. Will fix.

@@ -389,6 +390,12 @@ def test_pca_validation():
PCA(n_components, svd_solver=solver)
.fit, data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When constructing test in PCA, it is common to use for solver in solver_list: to loop through all the possible svd_solvers. There's already a loop above your code so it might be better to move your test into the loop to make sure that the code works for all svd_solvers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye, I will do it since @jnothman also suggested it

@@ -389,6 +390,12 @@ def test_pca_validation():
PCA(n_components, svd_solver=solver)
.fit, data)

n_components = 1.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might set n_components to something like 1.0 here because it seems a stronger test. But if you can make sure that your code work in such case, I think it's also acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye, I didnt do it because before it was > 1 for it to apply. But will do it now since it's a closed set.

@CoderPat
Copy link
Contributor Author

CoderPat commented Oct 30, 2017

@jnothman I think I will change for the individual solver as @qinhanmin2014 said, don't have to put the weird condition of n_condition != 'mle' plus seems more consistent.
Should have it in the moorning

@CoderPat
Copy link
Contributor Author

This should fix them all

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal suggestions for your reference :)

elif n_components >= 1:
if not np.issubdtype(type(n_components), np.integer):
raise ValueError("n_components=%r must be of type int "
"when greater or equal to 1, was of type=%r"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greater than or equal to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ye will fix, sounds better

"n_components={} must be of type int "
"when greater or equal to 1, was of type={}"
.format(n_components, type_ncom),
PCA(n_components).fit, data)
Copy link
Member

@qinhanmin2014 qinhanmin2014 Oct 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that you are not taking into account different svd_solvers. Remember that you are in the loop for solver in solver_list:, so it might be better to have something like PCA(n_components=..., svd_solver=solver).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye, that was my intention, but the morning coffee hadn't kicked in. Will fix

@CoderPat
Copy link
Contributor Author

Ok fixed (I hope)

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better if you can search through to see what others are doing when judging the type. np.issubtype seems a bit strange at least for me :)

@CoderPat
Copy link
Contributor Author

np.issubtype is numpy's recommended way to verify if it is any of the ints supported by numpy (either base python or numpy's ints). Just assumed you did it like this, will try to search for other type verifications of ints in the project later (would apreciate if a contributor who has done int verifications in the past could tell to avoid searching efforts xD)

@@ -417,6 +417,12 @@ def _fit_full(self, X, n_components):
"min(n_samples, n_features)=%r with "
"svd_solver='full'"
% (n_components, min(n_samples, n_features)))
elif n_components >= 1:
if not np.issubdtype(type(n_components), np.integer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely nitpicking (and I might even be wrong), but I think that it reads better like this:

if not isinstance(n_components, (int, np.integer)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye, just double checked and that is the same, and seems more elegant. I'll change it.

@massich
Copy link
Contributor

massich commented Oct 30, 2017

Just assumed you did it like this, will try to search for other type verifications of ints in the project later (would apreciate if a contributor who has done int verifications in the past could tell to avoid searching efforts xD)

@CoderPat I did the search not long ago due to #10017 Pr, and the conclusion is that there's no consistent manner. It is true that in some places np.issubdtype is being called but always for a numpy object in the following manner: np.issubdtype( something.dtype, .... But I didn't find np.issubdtype(type(...

I would write if not isinstance(n_components, integer_types):

@amueller
Copy link
Member

appveyor failure related:

ValueError: n_components=3L must be of type int when greater than or equal to 1, was of type=<type 'long'>

@massich
Copy link
Contributor

massich commented Oct 30, 2017

in that case, isinstance(xx, (long, int, np.integer)) it won't work all the time, so we should use something like six's string_types.

from ..externals.six import integer_types
...
     if isinstance(xx, integer_types + (np.integer,)):
     ...

@amueller do you think that we can define integer_types somewhere else so that it can include the numpy types already? Actually I faced the same question in #10017. I've no clear idea where to define integer_types to make it reusable naturally.

@amueller
Copy link
Member

numbers.Integral doesn't work?

@CoderPat
Copy link
Contributor Author

For the time being, I'm using numbers.Integral for compatibility, but if you do decide by having a integer_types I'll be glad to change it.

@jnothman
Copy link
Member

jnothman commented Oct 30, 2017 via email

@CoderPat
Copy link
Contributor Author

@jnothman isn't that just a work around for not using python built-in type function?
Seems it would acomplish the same as np.issubdtype(type(n_components), np.integer)that I was using originally since it works for built in int types. But if you feel is more idiomatic for scikit I can do it

@jnothman
Copy link
Member

jnothman commented Oct 31, 2017 via email

@CoderPat
Copy link
Contributor Author

CoderPat commented Oct 31, 2017

So we are supposed to assume 0d arrays as scalars? I remember reading a while ago that for numpy they are not the same so I assumed that scikit did the same. But ye, if we want that I will need to wrap into a 0d array and get the dtype

@jnothman
Copy link
Member

jnothman commented Oct 31, 2017 via email

@CoderPat
Copy link
Contributor Author

So what do you prefer? Does the rest scikit comply with accepting 0d arrays?

@jnothman
Copy link
Member

jnothman commented Oct 31, 2017 via email

@CoderPat
Copy link
Contributor Author

CoderPat commented Nov 1, 2017

Just to confirm, there is nothing else to do in this PR except waiting for the outcome of #10017?

@jnothman
Copy link
Member

jnothman commented Nov 2, 2017

Yes. Let's merge and fix it up there.

@jnothman
Copy link
Member

jnothman commented Nov 2, 2017

Thanks @CoderPat

@jnothman jnothman merged commit c3980bc into scikit-learn:master Nov 2, 2017
qinhanmin2014 added a commit that referenced this pull request Nov 2, 2017
@qinhanmin2014
Copy link
Member

@jnothman Seems that the PR has introduced a PEP8 error(Travis fails on master)
test_pca.py:404:1: E303 too many blank lines (3)
I fixed it in 8f958e9
Why is it not detected in the PR?

@jnothman
Copy link
Member

jnothman commented Nov 2, 2017 via email

@CoderPat
Copy link
Contributor Author

CoderPat commented Nov 2, 2017

flake8_diff script didnt detect it either...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I understand why it's not picked up here if only the diff is run through flake8, but I thought the same was true for the build on master.

@qinhanmin2014
Copy link
Member

@jnothman I think I figure out the problem.
There are actually two flake8 errors in the PR
(1)test_pca.py:404:1: E303 too many blank lines (3)
(2)test_pca.py:13:1: F811 redefinition of unused 'assert_raise_message' from line 10
Travis fails on master because of (2) not (1) (sorry that I do not look at the master log)
So here is the way to create a commit with a success flake8 check in the PR and a fail flake8 check in master:
(1)clone scikit-learn at time point 1 (master-1)
(2)scikit-learn add an import in a file (master-1 + commit-A = master-2)
(3)PR submitted based in master-1, add the same import in the same file but in different place(in this way, github will not detect conflict). The flake8 check will be based on master-1, commit-A will not be considered. So flake8 success.
(4)PR merged in master(master-2 + commit-B = master-3), the flake8 check will be based on master-2, so flake8 fails.

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need better error when n_components is float (was: TypeError when fitting GridSearchCV)
5 participants