-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis #3622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8036a2e
to
f09a94f
Compare
if axis < 0: | ||
axis += 2 | ||
if (axis != 0) and (axis != 1): | ||
raise ValueError("Invalid axis, use 0 for rows, or 1 for columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell to the user what was the given axis?
This is useful for introspecting code and debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -61,6 +61,9 @@ def mean_variance_axis0(X): | |||
X: CSR or CSC sparse matrix, shape (n_samples, n_features) | |||
Input data. | |||
|
|||
axis: int (either 0 or 1) | |||
Axis along which the axis should be computed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, you also accept -1 and -2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, out of consistency with other methods in sklearn (and scipy in general) that handle the axis argument this way as well (e.g. count_nonzero
in the same file), but those function don't document that usage, either. I assumed this is an sklearn convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. see also https://github.com/scipy/scipy/blob/master/scipy/sparse/compressed.py which uses the same convention thoughout, but never documents it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whereas the numpy.matrix.std
has a docstring that says:
Refer to `numpy.std` for full documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grepping through the numpy and scipy codebases, it seems like the most common way is to describe this as "axis : int" without specifying which values are allowed (which makes sense for numpy given that an ndarray can have any number of axis), while the scipy.sparse
module explicitly lists 0 and 1 as valid arguments (never -1 and -2, although the functions in questions do accept those values as well). Personally I think the way I documented it makes sense, as it's consistent with scipy.sparse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count_nonzero
is a backport from NumPy. We don't generally accept funny axes, since data is assumed to be 2-d almost everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what do you suggest would be the right thing to do? Remove -2/-1 as accepted values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd get rid of those. They're unlikely to be more useful than confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps more to the point, unlike scipy.sparse, utils here are not public.
On 3 September 2014 04:21, Lars Buitinck [email protected] wrote:
In sklearn/utils/sparsefuncs.py:
@@ -61,6 +61,9 @@ def mean_variance_axis0(X):
X: CSR or CSC sparse matrix, shape (n_samples, n_features)
Input data.
- axis: int (either 0 or 1)
Axis along which the axis should be computed.
Yes, I'd get rid of those. They're unlikely to be more useful than
confusing.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3622/files#r17004979.
99c8f48
to
50e20a0
Compare
I've removed the support for |
Travis CI error was due to out-of-error on Python 3.4, I don't think this is related to my patch |
+1 for merge. Also, thank you for taking the time to split up #2514. I would still like to see (most of) that merged. |
@@ -71,10 +74,20 @@ def mean_variance_axis0(X): | |||
Feature-wise variances | |||
|
|||
""" | |||
if axis != 0 and axis != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style / nitpick: this could be written as if axis not in (0, 1):
But ok, it's not really important.
looks good to merge |
I'll fix arjoly's last comment, that should also kick off Travis CI again, just to be sure |
This PR adds an 'axis' argument to sparsefuncs.mean_variance_axis, making it easier to calculate the columnwise mean/variance of sparse matrices.
While switching the codebase over to the new functionality, I noticed that
VarianceThreshold
needlessly converted CSC matrices to CSR. This PR also includes a commit to fix this, as the change fits naturally with the other changes.(If the immediate need for the new functionality is not obvious: This PR is the first in a series that tries to split up #2514 into a few smaller, easy-to-digest PRs. I plan to commit other PRs that will make use of this one)