Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division #9492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jrbourbeau
Copy link
Contributor

Reference Issue

Fixes #9489

What does this implement/fix? Explain your changes.

Currently the partial_fit method for sklearn.decomposition.IncrementalPCA will use integer division for Python 2 version and float division for Python 3 (see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/incremental_pca.py#L249). This PR ensures that float division is used for all Python versions by casting the relevant numerator to a floating-point number.

Any other comments?

While casting the numerator to a float fixes issue #9489. I'm not sure if this is the preferred style, or if adding from __future__ import division is a better option.

@NelleV
Copy link
Member

NelleV commented Aug 4, 2017

Hi @jrbourbeau
Thanks a lot for the patch. It looks good overall, but can you add a test?

@NelleV
Copy link
Member

NelleV commented Aug 4, 2017

(also, it might be more elegant to use the from __future__ import division solution.)

@jnothman
Copy link
Member

jnothman commented Aug 5, 2017 via email

@jrbourbeau
Copy link
Contributor Author

I would prefer the from __future__ import division route as well. @jnothman, do you think it's safe to assume that the internals of sklearn.decomposition.IncrementalPCA aren't something users will be doing a copy-paste with?

@jrbourbeau
Copy link
Contributor Author

@NelleV, do have any suggestions for a test to add? I was thinking the original code that caught this problem in issue #9489 might be a good test. Something like

import numpy as np
from sklearn.decomposition import IncrementalPCA

def test_partial_fit_correct_answer():
    A = np.array([[1, 2, 4], [5, 3, 6]])
    B = np.array([[6, 7, 3], [5, 2, 1], [3, 5, 6]])
    C = np.array([[3, 2, 1]])

    ipca = IncrementalPCA(n_components=2)
    ipca.partial_fit(A)
    ipca.partial_fit(B)
    # Know answer is [[-1.48864923 -3.15618645]]
    np.testing.assert_equal(ipca.transform(C), [[-1.48864923 -3.15618645]])

But I'm not sure if hard-coding the correct answer is the way to go, or if one should just simply check for float division (since that's the real problem here) with something like

def test_float_division():
    assert 3/2 == 1.5

@jnothman
Copy link
Member

jnothman commented Aug 9, 2017

A test hard-coding the correct answer with a comment saying it's a non regression test for issue#9489 would be great

@lesteve
Copy link
Member

lesteve commented Aug 9, 2017

Here is a snippet that can be easily turned into a regression test:

import numpy as np
from sklearn.decomposition import IncrementalPCA

rng = np.random.RandomState(0)
A = rng.randn(5, 3) + 2
B = rng.randn(7, 3) + 5

pca = IncrementalPCA(n_components=2)
pca.partial_fit(A)
pca.n_samples_seen_ = float(pca.n_samples_seen_)
pca.partial_fit(B)
print(pca.singular_values_)

pca2 = IncrementalPCA(n_components=2)
pca2.partial_fit(A)
pca2.partial_fit(B)
print(pca2.singular_values_)

With scikit-learn 0.18.2 and Python 2.7, the output is:

[ 7.70269857  3.88561777]
[ 6.64122066  3.88184665]

@jrbourbeau jrbourbeau changed the title Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division [MRG] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division Aug 10, 2017
@@ -273,3 +273,17 @@ def test_whitening():
assert_almost_equal(X, Xinv_ipca, decimal=prec)
assert_almost_equal(X, Xinv_pca, decimal=prec)
assert_almost_equal(Xinv_pca, Xinv_ipca, decimal=prec)


def test_partial_fit_correct_answer():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have something like #9492 (comment) as a regression test. To me this seems less magical than comparing to a "known answer" whose provenance is not very clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lesteve, thanks for the comment! I'll modify the test accordingly

@lesteve
Copy link
Member

lesteve commented Aug 10, 2017

Also you want to add an entry into doc/whats_new.rst about your fix. Do mention it is Python 2 only.

@jrbourbeau
Copy link
Contributor Author

It looks like I added a bunch of other commits I didn't intend to in my most recent push. I was trying to update doc/whats_new.rst with an entry for this PR, but had to rebase on master to get the updates to doc/whats_new.rst that had been added since the feature branch for this PR was made.

I attempted the rebase following the instructions from contributing guide (https://docs.scipy.org/doc/numpy/dev/gitwash/development_workflow.html#rebasing-on-master), but must have accidentally included all the other commits as well.

I'm not exactly sure what to do here. Does anyone know of a way to fix this? Or should I open up a new PR? Thanks!!

@jrbourbeau
Copy link
Contributor Author

I made a backup branch before I used git rebase. Would

# reset branch back to the saved point
git reset --hard backup_branch
# push this saved point to the remote branch for this PR
git push -f origin remote_feature_branch

work to revert this PR back to its state before I did the rebase?

@jnothman
Copy link
Member

jnothman commented Aug 10, 2017 via email

@jrbourbeau jrbourbeau force-pushed the IncrementalPCA-partial_fit-float-division branch from 51a693d to 3016393 Compare August 11, 2017 00:27
@jrbourbeau
Copy link
Contributor Author

Thanks so much @jnothman!

I want to add an entry to doc/whats_new.rst related to this PR. However, there have been changes made to the current master branch version of doc/whats_new.rst. Namely, a section for version 0.20 has been added. I would like to add an entry for this PR to the version 0.20 section without creating any merge conflicts with the current master branch.

I was wondering how I should go about this. One possible way would be to copy the new version 0.20 section from the current master branch into my feature branch, add an entry for this PR, and then commit it. Is this the preferred way, or is there a different way to go about this?

@jnothman
Copy link
Member

jnothman commented Aug 11, 2017 via email

…into IncrementalPCA-partial_fit-float-division

Want to get recent changes made to `doc/whats_new.rst` into my feature branch
@jrbourbeau
Copy link
Contributor Author

It worked! Thanks for helping me with that @jnothman

@jrbourbeau jrbourbeau changed the title [MRG] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division [WIP] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division Aug 11, 2017
@jrbourbeau
Copy link
Contributor Author

jrbourbeau commented Aug 11, 2017

PR #9526 updated doc/whats_new.rst on master in preparation for the 0.19 release, leading to a merge conflict. But now all conflicts have been resolved. Unless there are any other comments, I think this PR is good to go.

To summarize, this PR:

  1. Makes sure the partial_fit method for sklearn.decomposition.IncrementalPCA will use float
    division for Python 2 versions.
  2. Adds non-regression test for issue IncrementalPCA.partial_fit doesn't use float division in python 2 #9489 to sklearn/decomposition/tests/test_incremental_pca.py.
  3. Updates doc/whats_new.rst with an entry for this PR.

@jrbourbeau jrbourbeau changed the title [WIP] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division [MRG] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division Aug 11, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, LGTM. Thanks

occurs due to changes in the modelling logic (bug fixes or enhancements), or in
random sampling procedures.

- :class:`decomposition.IncrementalPCA` (bug fix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "in Python 2"

@jnothman jnothman changed the title [MRG] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division [MRG+1] Ensures that partial_fit for sklearn.decomposition.IncrementalPCA uses float division Aug 14, 2017
@NelleV NelleV merged commit 86d8f18 into scikit-learn:master Aug 14, 2017
@NelleV
Copy link
Member

NelleV commented Aug 14, 2017

Thanks @jrbourbeau !

@jrbourbeau jrbourbeau deleted the IncrementalPCA-partial_fit-float-division branch August 14, 2017 19:56
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…lPCA uses float division (scikit-learn#9492)

* Ensures that partial_fit uses float division

* Switches to using future division for float division

* Adds non-regression test for issue scikit-learn#9489

* Updates test to remove dependence on a "known answer"

* Updates doc/whats_new.rst with entry for this PR

* Specifies bug fix is for Python 2 versions in doc/whats_new.rst
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
…lPCA uses float division (scikit-learn#9492)

* Ensures that partial_fit uses float division

* Switches to using future division for float division

* Adds non-regression test for issue scikit-learn#9489

* Updates test to remove dependence on a "known answer"

* Updates doc/whats_new.rst with entry for this PR

* Specifies bug fix is for Python 2 versions in doc/whats_new.rst
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…lPCA uses float division (scikit-learn#9492)

* Ensures that partial_fit uses float division

* Switches to using future division for float division

* Adds non-regression test for issue scikit-learn#9489

* Updates test to remove dependence on a "known answer"

* Updates doc/whats_new.rst with entry for this PR

* Specifies bug fix is for Python 2 versions in doc/whats_new.rst
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…lPCA uses float division (scikit-learn#9492)

* Ensures that partial_fit uses float division

* Switches to using future division for float division

* Adds non-regression test for issue scikit-learn#9489

* Updates test to remove dependence on a "known answer"

* Updates doc/whats_new.rst with entry for this PR

* Specifies bug fix is for Python 2 versions in doc/whats_new.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IncrementalPCA.partial_fit doesn't use float division in python 2
4 participants