Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Increase mean precision for large float32 arrays #12338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 16, 2018

Conversation

bauks
Copy link
Contributor

@bauks bauks commented Oct 9, 2018

Reference Issues/PRs

Fixes #12333.

What does this implement/fix? Explain your changes.

Uses at least float64 precision when computing mean in _incremental_mean_and_var.
This avoids precision issues with np.mean on long multidimensional float32 arrays as discussed in numpy/numpy#9393

@jnothman
Copy link
Member

Can you confirm that there is an existing test ensuring the inputs dtype is maintained after scaling, and add one if not?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@bauks
Copy link
Contributor Author

bauks commented Oct 10, 2018

I didn't see such a test so I added one. Note that it fails even before this change if we include np.float128, due to the check_array call in StandardScaler.partial_fit with dtype=FLOAT_DTYPES; this seems fine as a clear warning is given:
sklearn/utils/validation.py:590: DataConversionWarning: Data with input dtype float128 was converted to float64 by StandardScaler.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, does the dtype of mean_ etc change? This might at least deserve a note in what's new, if not a test. Otherwise this is looking good!

@jnothman
Copy link
Member

I'd be okay to include this in 0.20.X.

@bauks
Copy link
Contributor Author

bauks commented Oct 10, 2018

The dtype of mean_ and scale_ is always float64, which is also the case without this change. Should I add a check for this to the test?

@jnothman
Copy link
Member

jnothman commented Oct 11, 2018 via email

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart for the comment below, and https://github.com/scikit-learn/scikit-learn/pull/12338/files#r224266370 still needs addressing (by removing copy=True).

@bauks
Copy link
Contributor Author

bauks commented Oct 12, 2018

Great, I think those issues have been addressed now.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add a what's new in doc/whats_new/v0.20.rst under the 0.20.1 section. As far as I can tell, estimators affected by this are preprocessing.StandardScaler and decomposition.IncrementalPCA

@rth rth added this to the 0.20.1 milestone Oct 13, 2018
Jonathan Blackman added 6 commits October 15, 2018 10:24
@rth
Copy link
Member

rth commented Oct 16, 2018

Thanks, merging, the appveyor failure is unrelated.

precision issues when using float32 datasets. Affects
:class:`preprocessing.StandardScaler` and
:class:`decomposition.IncrementalPCA`.
:issue:`12333` by :user:`bauks <bauks>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should reference the PR and we can't mention a private function in the what's new, only the effect of this change on public estimators. I'll fix it ..

@rth rth merged commit a1d0e96 into scikit-learn:master Oct 16, 2018
anuragkapale pushed a commit to anuragkapale/scikit-learn that referenced this pull request Oct 23, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StandardScaler obtains incorrect means for large np.float32 dtype datasets
3 participants