-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Increase mean precision for large float32 arrays #12338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can you confirm that there is an existing test ensuring the inputs dtype is maintained after scaling, and add one if not? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
I didn't see such a test so I added one. Note that it fails even before this change if we include np.float128, due to the check_array call in StandardScaler.partial_fit with dtype=FLOAT_DTYPES; this seems fine as a clear warning is given: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, does the dtype of mean_ etc change? This might at least deserve a note in what's new, if not a test. Otherwise this is looking good!
I'd be okay to include this in 0.20.X. |
The dtype of mean_ and scale_ is always float64, which is also the case without this change. Should I add a check for this to the test? |
Sure, thanks!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart for the comment below, and https://github.com/scikit-learn/scikit-learn/pull/12338/files#r224266370 still needs addressing (by removing copy=True
).
Great, I think those issues have been addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please add a what's new in doc/whats_new/v0.20.rst
under the 0.20.1 section. As far as I can tell, estimators affected by this are preprocessing.StandardScaler
and decomposition.IncrementalPCA
Use at least float64 when computing mean in _incremental_mean_and_var. This avoids precision issues with np.mean on long multidimensional float32 arrays.
Thanks, merging, the appveyor failure is unrelated. |
doc/whats_new/v0.20.rst
Outdated
precision issues when using float32 datasets. Affects | ||
:class:`preprocessing.StandardScaler` and | ||
:class:`decomposition.IncrementalPCA`. | ||
:issue:`12333` by :user:`bauks <bauks>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this should reference the PR and we can't mention a private function in the what's new, only the effect of this change on public estimators. I'll fix it ..
…learn#12338)" This reverts commit 91f7a68.
…learn#12338)" This reverts commit 91f7a68.
Reference Issues/PRs
Fixes #12333.
What does this implement/fix? Explain your changes.
Uses at least float64 precision when computing mean in _incremental_mean_and_var.
This avoids precision issues with np.mean on long multidimensional float32 arrays as discussed in numpy/numpy#9393