-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add check for n_components in pca #10042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add check for n_components in pca #10042
Conversation
LGTM. Happy to merge when Appveyor passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And perhaps @qinhanmin2014 has a point about putting the code in _fit_full
and _fit_truncated
. I'm ambivalent. WDYT, @CoderPat?
@@ -389,6 +390,12 @@ def test_pca_validation(): | |||
PCA(n_components, svd_solver=solver) | |||
.fit, data) | |||
|
|||
n_components = 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we please test for different solvers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
It might make sense then to add the checks on individual solvers (see the discussion/comments in the issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal suggestions for your reference :)
Still doubt whether it is the right place for the check but you should follow the decision from core devs.
sklearn/decomposition/pca.py
Outdated
(n_components > 1 and | ||
not (np.issubdtype(type(n_components), np.integer))): | ||
raise ValueError("n_components=%r must be of type int " | ||
"when bigger than 1, was of type=%r" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might worth to mention greater than or equal to 1 in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought it would fall on the the float case, but it has to be lesser than 1, but just checked you are right. My bad. Will fix.
@@ -389,6 +390,12 @@ def test_pca_validation(): | |||
PCA(n_components, svd_solver=solver) | |||
.fit, data) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When constructing test in PCA, it is common to use for solver in solver_list:
to loop through all the possible svd_solvers. There's already a loop above your code so it might be better to move your test into the loop to make sure that the code works for all svd_solvers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye, I will do it since @jnothman also suggested it
@@ -389,6 +390,12 @@ def test_pca_validation(): | |||
PCA(n_components, svd_solver=solver) | |||
.fit, data) | |||
|
|||
n_components = 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might set n_components to something like 1.0 here because it seems a stronger test. But if you can make sure that your code work in such case, I think it's also acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye, I didnt do it because before it was > 1 for it to apply. But will do it now since it's a closed set.
@jnothman I think I will change for the individual solver as @qinhanmin2014 said, don't have to put the weird condition of |
This should fix them all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal suggestions for your reference :)
sklearn/decomposition/pca.py
Outdated
elif n_components >= 1: | ||
if not np.issubdtype(type(n_components), np.integer): | ||
raise ValueError("n_components=%r must be of type int " | ||
"when greater or equal to 1, was of type=%r" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greater than or equal to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ye will fix, sounds better
"n_components={} must be of type int " | ||
"when greater or equal to 1, was of type={}" | ||
.format(n_components, type_ncom), | ||
PCA(n_components).fit, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that you are not taking into account different svd_solvers. Remember that you are in the loop for solver in solver_list:
, so it might be better to have something like PCA(n_components=..., svd_solver=solver)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye, that was my intention, but the morning coffee hadn't kicked in. Will fix
Ok fixed (I hope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better if you can search through to see what others are doing when judging the type. np.issubtype seems a bit strange at least for me :)
|
sklearn/decomposition/pca.py
Outdated
@@ -417,6 +417,12 @@ def _fit_full(self, X, n_components): | |||
"min(n_samples, n_features)=%r with " | |||
"svd_solver='full'" | |||
% (n_components, min(n_samples, n_features))) | |||
elif n_components >= 1: | |||
if not np.issubdtype(type(n_components), np.integer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely nitpicking (and I might even be wrong), but I think that it reads better like this:
if not isinstance(n_components, (int, np.integer)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye, just double checked and that is the same, and seems more elegant. I'll change it.
@CoderPat I did the search not long ago due to #10017 Pr, and the conclusion is that there's no consistent manner. It is true that in some places I would write |
appveyor failure related:
|
in that case,
@amueller do you think that we can define |
numbers.Integral doesn't work? |
For the time being, I'm using numbers.Integral for compatibility, but if you do decide by having a integer_types I'll be glad to change it. |
issubdtype seems appropriate but you might first need to wrap the scalar in
an array and get its dtype in the case where the input may be a python or
numpy type...?
|
@jnothman isn't that just a work around for not using python built-in type function? |
n_components = np.array(6)
assert np.issubdtype(type(n_components), np.integer)
AssertionError
…On 31 October 2017 at 11:24, Patrick Fernandes ***@***.***> wrote:
@jnothman <https://github.com/jnothman> isn't that just a work around for
not using python built-in type function?
Seems it would acomplish the same as np.issubdtype(type(n_components),
np.integer)that I was using originally since it works for built in int
types. But if you feel is more idiomatic for scikit I can do it
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10042 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_evCUmm5dZEO-NNx1ijRvBrBgMXks5sxmimgaJpZM4QKiHN>
.
|
So we are supposed to assume 0d arrays as scalars? I remember reading a while ago that for numpy they are not the same so I assumed that scikit did the same. But ye, if we want that I will need to wrap into a 0d array and get the dtype |
maybe you're right. we've had occasional issues of receiving 0d arrays for
parameters, but I can't recall in what context
|
So what do you prefer? Does the rest scikit comply with accepting 0d arrays? |
I think not consistently. Check and comment on the discussion at #7394?
On 31 Oct 2017 10:47 pm, "Patrick Fernandes" <[email protected]> wrote:
So what do you prefer? Does the rest scikit comply with accepting 0d arrays?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10042 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z7VlRgwHSq4muQSkXHxv8EihUcqks5sxwjMgaJpZM4QKiHN>
.
|
Just to confirm, there is nothing else to do in this PR except waiting for the outcome of #10017? |
Yes. Let's merge and fix it up there. |
Thanks @CoderPat |
I wondered the same...
|
flake8_diff script didnt detect it either... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I understand why it's not picked up here if only the diff is run through flake8, but I thought the same was true for the build on master.
@jnothman I think I figure out the problem. |
Reference Issues/PRs
Fixes #10034
What does this implement/fix? Explain your changes.
Previously, passing a
n_components
bigger than 1 that wasn't a integer would originate confusing errors. I fixed this by doing explicit verification and rasing a ValueError for it (+ tests ofcourse)Any other comments?
I opted for doing it in the
_fit
method, but could have been done in_fit_full
or_fit_truncate
, where other verifications are done. Can change if reviewers prefer