Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG + 1] Raising an error when batch_size < n_components in IncrementalPCA #9303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Aug 14, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d382bb4
fixed bug (not tested), writing test
Jul 8, 2017
fcb2768
removed lower interval comparison check from fix, more work on test
Jul 8, 2017
d4bd366
fix was failing another test, + finished test for fix
Jul 8, 2017
71c5a73
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
Jul 10, 2017
2cff58d
Revert "Merge branch 'master' of https://github.com/scikit-learn/scik…
Jul 14, 2017
624e3dd
Merge branch 'master' into n_samples6452
wallygauze Jul 14, 2017
5b250ce
Correcting side-effects from reverting merge
wallygauze Jul 14, 2017
c508034
Correction number 2
wallygauze Jul 14, 2017
e6b38e3
Correction number 3
wallygauze Jul 14, 2017
93f7301
Correction number 4
wallygauze Jul 14, 2017
1acfd8b
Correction number 5
wallygauze Jul 14, 2017
289a8ac
Last Correction
wallygauze Jul 14, 2017
be5ac2d
added regression tests for n_comp=None case in incremental pca
Jul 17, 2017
090c0f4
Merge branch 'n_samples6452' of https://github.com/wallygauze/scikit-…
Jul 17, 2017
eee25b3
some lines were never used, turned to code better for coverage
Jul 17, 2017
46fd392
Update whats_new.rst
wallygauze Jul 24, 2017
a755554
modifying error message (part 1)
wallygauze Jul 25, 2017
522ebe0
modifying error message part2
wallygauze Jul 25, 2017
5bdc0f3
Minor improvements in test_pca.py
lesteve Aug 3, 2017
d15c601
moved entry to 0.20
Aug 14, 2017
5d989a9
Merge branch 'master' into n_samples6452
wallygauze Aug 14, 2017
41d4613
Merge branch 'master' into n_samples6452
jnothman Aug 14, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,16 @@ Bug fixes

Decomposition, manifold learning and clustering

- Fix for uninformative error in :class:`decomposition.incremental_pca`:
now an error is raised if the number of components is larger than the
chosen batch size. The ``n_components=None`` case was adapted accordingly.
:issue:`6452`. By :user:`Wally Gauze <wallygauze>`.

- Fixed a bug where the ``partial_fit`` method of
:class:`decomposition.IncrementalPCA` used integer division instead of float
division on Python 2 versions. :issue:`9492` by
:user:`James Bourbeau <jrbourbeau>`.


Version 0.19
============

Expand Down
9 changes: 8 additions & 1 deletion sklearn/decomposition/incremental_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,11 +211,18 @@ def partial_fit(self, X, y=None, check_input=True):
self.components_ = None

if self.n_components is None:
self.n_components_ = n_features
if self.components_ is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add regression tests for these? I guess if n_features < n_samples we had an error earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the master had an error if n_samples < n_features (you wrote the opposite, but I believe it was a typo right?). As a ‘visual’ aid, this is the partial_fit method, so n_samples is equivalent to the size of the batches used.

self.n_components_ = min(n_samples, n_features)
else:
self.n_components_ = self.components_.shape[0]
elif not 1 <= self.n_components <= n_features:
raise ValueError("n_components=%r invalid for n_features=%d, need "
"more rows than columns for IncrementalPCA "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what "more rows than columns means here" ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the message is good here either, but I wanted to focus my pull request on the points mentioned in #6452 (so that it would be reviewed and merged more quickly).

"processing" % (self.n_components, n_features))
elif not self.n_components <= n_samples:
raise ValueError("n_components=%r must be less or equal to "
"the batch number of samples "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough we were chatting with @ogrisel about this yesterday in an unrelated context. IIUC he was hoping that IncrementalPCA would be able to do partial_fit on a small number of samples (and converge to something sensible after a few calls to partial_fit). It looks like this is not the case at the moment ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I bumped into the same problem (checking that n_components <= n_features but not n_components <= n_samples) in sklearn/decomposition/pca.py yesterday. There are also slight inconsistencies between _fit_full and _fit_truncated. Not in this PR but I think we should have a helper function that is reused where appropriate.

Copy link
Contributor Author

@wallygauze wallygauze Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve I have a pull-request for PCA as well --> #8742.
It has received a number of reviews and it all seems pretty much finished, but it has not received much attention these last months because it was not marked for the 0.19 release (which on second thoughts may be incongruous since it's practically the same as this.)

Do you want to have a look, I do think it may be a quick case to just finish off.

"%d." % (self.n_components, n_samples))
else:
self.n_components_ = self.n_components

Expand Down
40 changes: 36 additions & 4 deletions sklearn/decomposition/tests/test_incremental_pca.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from sklearn.utils.testing import assert_almost_equal
from sklearn.utils.testing import assert_array_almost_equal
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import assert_raises_regex

from sklearn import datasets
from sklearn.decomposition import PCA, IncrementalPCA
Expand Down Expand Up @@ -73,10 +74,41 @@ def test_incremental_pca_inverse():

def test_incremental_pca_validation():
# Test that n_components is >=1 and <= n_features.
X = [[0, 1], [1, 0]]
for n_components in [-1, 0, .99, 3]:
assert_raises(ValueError, IncrementalPCA(n_components,
batch_size=10).fit, X)
X = np.array([[0, 1, 0], [1, 0, 0]])
n_samples, n_features = X.shape
for n_components in [-1, 0, .99, 4]:
assert_raises_regex(ValueError,
"n_components={} invalid for n_features={}, need"
" more rows than columns for IncrementalPCA "
"processing".format(n_components, n_features),
IncrementalPCA(n_components, batch_size=10).fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be raised for partial_fit?


# Tests that n_components is also <= n_samples.
n_components = 3
assert_raises_regex(ValueError,
"n_components={} must be less or equal to "
"the batch number of samples {}".format(
n_components, n_samples),
IncrementalPCA(
n_components=n_components).partial_fit, X)


def test_n_components_none():
# Ensures that n_components == None is handled correctly
rng = np.random.RandomState(1999)
for n_samples, n_features in [(50, 10), (10, 50)]:
X = rng.rand(n_samples, n_features)
ipca = IncrementalPCA(n_components=None)

# First partial_fit call, ipca.n_components_ is inferred from
# min(X.shape)
ipca.partial_fit(X)
assert ipca.n_components_ == min(X.shape)

# Second partial_fit call, ipca.n_components_ is inferred from
# ipca.components_ computed from the first partial_fit call
ipca.partial_fit(X)
assert ipca.n_components_ == ipca.components_.shape[0]


def test_incremental_pca_set_params():
Expand Down