[MRG] Add check for n_components in pca #10042

CoderPat · 2017-10-29T23:53:34Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Previously, passing a n_components bigger than 1 that wasn't a integer would originate confusing errors. I fixed this by doing explicit verification and rasing a ValueError for it (+ tests ofcourse)

Any other comments?

I opted for doing it in the _fit method, but could have been done in _fit_full or _fit_truncate, where other verifications are done. Can change if reviewers prefer

jnothman · 2017-10-30T00:46:28Z

LGTM. Happy to merge when Appveyor passes.

jnothman

And perhaps @qinhanmin2014 has a point about putting the code in _fit_full and _fit_truncated. I'm ambivalent. WDYT, @CoderPat?

jnothman · 2017-10-30T00:48:32Z

sklearn/decomposition/tests/test_pca.py

@@ -389,6 +390,12 @@ def test_pca_validation():
                                    PCA(n_components, svd_solver=solver)
                                    .fit, data)

+    n_components = 1.2


Actually, can we please test for different solvers?

Sure.
It might make sense then to add the checks on individual solvers (see the discussion/comments in the issue).

qinhanmin2014

Personal suggestions for your reference :)
Still doubt whether it is the right place for the check but you should follow the decision from core devs.

qinhanmin2014 · 2017-10-30T00:39:51Z

sklearn/decomposition/pca.py

+                (n_components > 1 and
+                 not (np.issubdtype(type(n_components), np.integer))):
+            raise ValueError("n_components=%r must be of type int "
+                             "when bigger than 1, was of type=%r"


I think it might worth to mention greater than or equal to 1 in this case.

Thought it would fall on the the float case, but it has to be lesser than 1, but just checked you are right. My bad. Will fix.

qinhanmin2014 · 2017-10-30T00:45:07Z

sklearn/decomposition/tests/test_pca.py

@@ -389,6 +390,12 @@ def test_pca_validation():
                                    PCA(n_components, svd_solver=solver)
                                    .fit, data)



When constructing test in PCA, it is common to use for solver in solver_list: to loop through all the possible svd_solvers. There's already a loop above your code so it might be better to move your test into the loop to make sure that the code works for all svd_solvers.

Ye, I will do it since @jnothman also suggested it

qinhanmin2014 · 2017-10-30T00:49:28Z

sklearn/decomposition/tests/test_pca.py

@@ -389,6 +390,12 @@ def test_pca_validation():
                                    PCA(n_components, svd_solver=solver)
                                    .fit, data)

+    n_components = 1.2


I might set n_components to something like 1.0 here because it seems a stronger test. But if you can make sure that your code work in such case, I think it's also acceptable.

Ye, I didnt do it because before it was > 1 for it to apply. But will do it now since it's a closed set.

CoderPat · 2017-10-30T01:03:31Z

@jnothman I think I will change for the individual solver as @qinhanmin2014 said, don't have to put the weird condition of n_condition != 'mle' plus seems more consistent.
Should have it in the moorning

CoderPat · 2017-10-30T10:42:04Z

This should fix them all

qinhanmin2014

Personal suggestions for your reference :)

qinhanmin2014 · 2017-10-30T10:43:42Z

sklearn/decomposition/pca.py

+        elif n_components >= 1:
+            if not np.issubdtype(type(n_components), np.integer):
+                raise ValueError("n_components=%r must be of type int "
+                                 "when greater or equal to 1, was of type=%r"


greater than or equal to?

ye will fix, sounds better

qinhanmin2014 · 2017-10-30T10:51:26Z

sklearn/decomposition/tests/test_pca.py

+                             "n_components={} must be of type int "
+                             "when greater or equal to 1, was of type={}"
+                             .format(n_components, type_ncom),
+                             PCA(n_components).fit, data)


Seems that you are not taking into account different svd_solvers. Remember that you are in the loop for solver in solver_list:, so it might be better to have something like PCA(n_components=..., svd_solver=solver).

Ye, that was my intention, but the morning coffee hadn't kicked in. Will fix

CoderPat · 2017-10-30T11:06:14Z

Ok fixed (I hope)

qinhanmin2014

It might be better if you can search through to see what others are doing when judging the type. np.issubtype seems a bit strange at least for me :)

CoderPat · 2017-10-30T11:54:01Z

np.issubtype is numpy's recommended way to verify if it is any of the ints supported by numpy (either base python or numpy's ints). Just assumed you did it like this, will try to search for other type verifications of ints in the project later (would apreciate if a contributor who has done int verifications in the past could tell to avoid searching efforts xD)

massich · 2017-10-30T15:21:01Z

sklearn/decomposition/pca.py

@@ -417,6 +417,12 @@ def _fit_full(self, X, n_components):
                             "min(n_samples, n_features)=%r with "
                             "svd_solver='full'"
                             % (n_components, min(n_samples, n_features)))
+        elif n_components >= 1:
+            if not np.issubdtype(type(n_components), np.integer):


Completely nitpicking (and I might even be wrong), but I think that it reads better like this:

if not isinstance(n_components, (int, np.integer)):

Ye, just double checked and that is the same, and seems more elegant. I'll change it.

massich · 2017-10-30T15:39:22Z

Just assumed you did it like this, will try to search for other type verifications of ints in the project later (would apreciate if a contributor who has done int verifications in the past could tell to avoid searching efforts xD)

@CoderPat I did the search not long ago due to #10017 Pr, and the conclusion is that there's no consistent manner. It is true that in some places np.issubdtype is being called but always for a numpy object in the following manner: np.issubdtype( something.dtype, .... But I didn't find np.issubdtype(type(...

I would write if not isinstance(n_components, integer_types):

amueller · 2017-10-30T17:50:55Z

appveyor failure related:

ValueError: n_components=3L must be of type int when greater than or equal to 1, was of type=<type 'long'>

massich · 2017-10-30T18:50:42Z

in that case, isinstance(xx, (long, int, np.integer)) it won't work all the time, so we should use something like six's string_types.

from ..externals.six import integer_types
...
     if isinstance(xx, integer_types + (np.integer,)):
     ...

@amueller do you think that we can define integer_types somewhere else so that it can include the numpy types already? Actually I faced the same question in #10017. I've no clear idea where to define integer_types to make it reusable naturally.

amueller · 2017-10-30T18:58:13Z

numbers.Integral doesn't work?

CoderPat · 2017-10-30T19:44:17Z

For the time being, I'm using numbers.Integral for compatibility, but if you do decide by having a integer_types I'll be glad to change it.

jnothman · 2017-10-30T22:37:24Z

issubdtype seems appropriate but you might first need to wrap the scalar in an array and get its dtype in the case where the input may be a python or numpy type...?

CoderPat · 2017-10-31T00:24:04Z

@jnothman isn't that just a work around for not using python built-in type function?
Seems it would acomplish the same as np.issubdtype(type(n_components), np.integer)that I was using originally since it works for built in int types. But if you feel is more idiomatic for scikit I can do it

jnothman · 2017-10-31T00:54:08Z

n_components = np.array(6) assert np.issubdtype(type(n_components), np.integer) AssertionError

…

On 31 October 2017 at 11:24, Patrick Fernandes ***@***.***> wrote: @jnothman <https://github.com/jnothman> isn't that just a work around for not using python built-in type function? Seems it would acomplish the same as np.issubdtype(type(n_components), np.integer)that I was using originally since it works for built in int types. But if you feel is more idiomatic for scikit I can do it — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10042 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_evCUmm5dZEO-NNx1ijRvBrBgMXks5sxmimgaJpZM4QKiHN> .

CoderPat · 2017-10-31T09:33:33Z

So we are supposed to assume 0d arrays as scalars? I remember reading a while ago that for numpy they are not the same so I assumed that scikit did the same. But ye, if we want that I will need to wrap into a 0d array and get the dtype

jnothman · 2017-10-31T09:52:04Z

maybe you're right. we've had occasional issues of receiving 0d arrays for parameters, but I can't recall in what context

CoderPat · 2017-10-31T11:47:21Z

So what do you prefer? Does the rest scikit comply with accepting 0d arrays?

jnothman · 2017-10-31T20:15:15Z

I think not consistently. Check and comment on the discussion at #7394? On 31 Oct 2017 10:47 pm, "Patrick Fernandes" <[email protected]> wrote: So what do you prefer? Does the rest scikit comply with accepting 0d arrays? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10042 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6z7VlRgwHSq4muQSkXHxv8EihUcqks5sxwjMgaJpZM4QKiHN> .

CoderPat · 2017-11-01T10:45:43Z

Just to confirm, there is nothing else to do in this PR except waiting for the outcome of #10017?

jnothman · 2017-11-02T08:21:31Z

Yes. Let's merge and fix it up there.

jnothman · 2017-11-02T08:21:48Z

Thanks @CoderPat

qinhanmin2014 · 2017-11-02T08:44:04Z

@jnothman Seems that the PR has introduced a PEP8 error(Travis fails on master)
test_pca.py:404:1: E303 too many blank lines (3)
I fixed it in 8f958e9
Why is it not detected in the PR?

jnothman · 2017-11-02T09:09:36Z

I wondered the same...

CoderPat · 2017-11-02T12:18:17Z

flake8_diff script didnt detect it either...

jnothman

I mean, I understand why it's not picked up here if only the diff is run through flake8, but I thought the same was true for the build on master.

qinhanmin2014 · 2017-11-02T15:08:06Z

@jnothman I think I figure out the problem.
There are actually two flake8 errors in the PR
(1)test_pca.py:404:1: E303 too many blank lines (3)
(2)test_pca.py:13:1: F811 redefinition of unused 'assert_raise_message' from line 10
Travis fails on master because of (2) not (1) (sorry that I do not look at the master log)
So here is the way to create a commit with a success flake8 check in the PR and a fail flake8 check in master:
(1)clone scikit-learn at time point 1 (master-1)
(2)scikit-learn add an import in a file (master-1 + commit-A = master-2)
(3)PR submitted based in master-1, add the same import in the same file but in different place(in this way, github will not detect conflict). The flake8 check will be based on master-1, commit-A will not be considered. So flake8 success.
(4)PR merged in master(master-2 + commit-B = master-3), the flake8 check will be based on master-2, so flake8 fails.

Add check for n_components in pca

8bf05bd

CoderPat changed the title ~~Add check for n_components in pca~~ [MRG] Add check for n_components in pca Oct 29, 2017

jnothman changed the title ~~[MRG] Add check for n_components in pca~~ [MRG+1] Add check for n_components in pca Oct 30, 2017

jnothman reviewed Oct 30, 2017

View reviewed changes

jnothman changed the title ~~[MRG+1] Add check for n_components in pca~~ [MRG] Add check for n_components in pca Oct 30, 2017

qinhanmin2014 reviewed Oct 30, 2017

View reviewed changes

CoderPat added 2 commits October 30, 2017 10:34

Add more consistency to checks and more tests

c8fab7e

Fix mistake in identation

23df5f7

qinhanmin2014 reviewed Oct 30, 2017

View reviewed changes

Fix typo and bug in solver selection in test

11824dc

qinhanmin2014 approved these changes Oct 30, 2017

View reviewed changes

massich reviewed Oct 30, 2017

View reviewed changes

Add more consistency to type checking

3b26974

fix type checking for compatibilty with python2 long

d1be530

CoderPat mentioned this pull request Oct 31, 2017

[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation. #10017

Closed

jnothman merged commit c3980bc into scikit-learn:master Nov 2, 2017

qinhanmin2014 added a commit that referenced this pull request Nov 2, 2017

Fix PEP8 error for #10042

8f958e9

jnothman reviewed Nov 2, 2017

View reviewed changes

This was referenced Nov 2, 2017

[MRG] MAINT Remove duplicate import in test_pca.py #10060

Closed

[MRG] MAINT remove duplicate import in test_pca.py #10061

Merged

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG] Add check for n_components in pca (scikit-learn#10042)

6be11b6

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

Fix PEP8 error for scikit-learn#10042

57e9231

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG] Add check for n_components in pca (scikit-learn#10042)

48a1dce

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Fix PEP8 error for scikit-learn#10042

1b16bb1

rth mentioned this pull request Apr 18, 2018

Support n_components being a float for TruncatedSVD #10988

Closed

qinhanmin2014 mentioned this pull request Jun 10, 2018

[MRG+1] Tests refactoring, PEP8/flake fixes #11227

Closed

		@@ -389,6 +390,12 @@ def test_pca_validation():
		PCA(n_components, svd_solver=solver)
		.fit, data)

Uh oh!

[MRG] Add check for n_components in pca #10042

[MRG] Add check for n_components in pca #10042

Uh oh!

Conversation

CoderPat commented Oct 29, 2017

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Oct 30, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoderPat Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoderPat commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CoderPat commented Oct 30, 2017

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoderPat commented Oct 30, 2017

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

CoderPat commented Oct 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

massich commented Oct 30, 2017

Uh oh!

amueller commented Oct 30, 2017

Uh oh!

massich commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 30, 2017

Uh oh!

CoderPat commented Oct 30, 2017

Uh oh!

jnothman commented Oct 30, 2017 via email

Uh oh!

CoderPat commented Oct 31, 2017

Uh oh!

CoderPat Oct 30, 2017 •

edited

Loading

CoderPat commented Oct 30, 2017 •

edited

Loading

qinhanmin2014 Oct 30, 2017 •

edited

Loading

massich commented Oct 30, 2017 •

edited

Loading

CoderPat commented Oct 31, 2017 •

edited

Loading