-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Add "grouped" option to Scaler classes #4963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Add "grouped" option to Scaler classes #4963
Conversation
495fdd9
to
771701d
Compare
Tagging @amueller because this PR is a product of the SciPy 2015 sprints. I think this is ready for review now. |
sklearn/utils/sparsefuncs.py
Outdated
variances: float array with shape (n_features,) | ||
Feature-wise variances | ||
variances: float array with shape (n_features,) or scalar | ||
Axis variances (or array mean if `axis` is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Axis variances (or array variance if axis
is None)
Looks good ! |
@TomDLT , thank you for the careful read! I fixed the issues you pointed out. I left it in a separate commit for now so you could check only my fixes; if you're satisfied, I'll squash them into the previous commit to keep the history clean. On the |
...and another new commit to fix a nasty bug in the |
The tests fail under windows with:
Apparently the availability of http://stackoverflow.com/questions/9062562/what-is-the-internal-precision-of-numpy-float128 Please adjust the test to run the checks with |
More details in http://stackoverflow.com/a/17023995/163740 . Based on this information I think we should just not use |
4ccd79d
to
54b00e2
Compare
As per that stackoverflow post, would np.longdouble be okay? I agree that float32 and float64 are the only ones that really matter. I just want to be certain that |
54b00e2
to
2314ba7
Compare
I think we don't need to test that. This is a YAGNI. Let's keep the tests simple to maintain and focus on the useful things. |
|
2314ba7
to
576c84e
Compare
@ogrisel , done. The test now uses |
@@ -48,6 +50,63 @@ def csr_row_norms(X): | |||
@cython.boundscheck(False) | |||
@cython.wraparound(False) | |||
@cython.cdivision(True) | |||
def sparse_mean_variance(X): | |||
"""Compute the variance of a sparse matrix. | |||
Will copy data to an array of doubles if input data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please insert a blank line after the first line of the docstring.
576c84e
to
aadec16
Compare
Defaults to True, which is the previous behavior. Also allow the scalar functions to take "axis=None" inputs in addition to 0 or 1. Includes some tweaks to the sparsefuncs helper functions to deal with `axis=None` inputs. Includes a new function `sparse_mean_variance` in "sparsefuncs_fast.pyx" to find the variance of a sparse array; that didn't seem to exist yet. This is one line in pure Python, but writing in Cython is faster and avoids extra memory use (as long as the input array has dtype np.float32 or np.float64).
Was inadvertently returning 0 instead of the correct answer for datatypes other than np.float32 and np.float64.
Every function in the module used `@cython.boundscheck(False)`, `@cython.wraparound(False)`, and `@cython.cdivision(True)`. Move those settings to module level to simplify the code.
6bbc583
to
1547b8e
Compare
sklearn/utils/sparsefuncs.py
Outdated
|
||
""" | ||
if axis not in (0, 1): | ||
if axis not in (0, 1, None): | ||
raise ValueError( | ||
"Unknown axis value: %d. Use 0 for rows, or 1 for columns" % axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we update this message?
I don't know if not handling Otherwise it looks good to me. |
This code will handle I changed the error message you pointed out. |
Hi @stephen-hoover ... sorry, apparently this pull request got lost.... are you still interested in finishing it? Thanks for your patience. |
@cmarmo - Oh, wow, I'd forgotten about this PR. I will take a look at it and try to figure out how much work it would take to bring it up to date. I don't have a huge amount of free time anymore, so I couldn't redo it from scratch. Is this feature still useful? If you're asking, I'm guessing the answer is yes. |
I don't think it would hurt, but I also don't think there's substantial demand for this change |
@stephen-hoover , thanks for your answer. I'm checking to relabel if necessary, in particular because your PR was "Waiting for Reviewer". If you are not available anymore to work on this, there is no problem. |
Since the original issue is closed, I am closing this PR as well. Thank you for working on this PR. |
As per discussion in issue #4892 , add a "per_feature" option to the Scaler classes. When False (defaults to the previous behavior of True), this modifies behavior such that scaling is based on a consideration of the entire data array at once, instead of one feature at a time. Also allow "axis=None" in addition to axis=0 or axis=1 in the standalone scaling functions.
This PR includes tweaks to functions in the "sparsefuncs" module where it makes axis=None behavior easier to code.
TODO:
preprocessing.data.scale
preprocessing.data.maxabs_scale
preprocessing.data.robust_scale