Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Add "grouped" option to Scaler classes #4963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stephen-hoover
Copy link
Contributor

As per discussion in issue #4892 , add a "per_feature" option to the Scaler classes. When False (defaults to the previous behavior of True), this modifies behavior such that scaling is based on a consideration of the entire data array at once, instead of one feature at a time. Also allow "axis=None" in addition to axis=0 or axis=1 in the standalone scaling functions.

This PR includes tweaks to functions in the "sparsefuncs" module where it makes axis=None behavior easier to code.

TODO:

  • Add "per_feature" option to RobustScaler
  • Add "axis=None" option to preprocessing.data.scale
  • Add "axis=None" option to preprocessing.data.maxabs_scale
  • Add "axis=None" option to preprocessing.data.robust_scale

@stephen-hoover stephen-hoover force-pushed the grouped-feature-scaling branch 4 times, most recently from 495fdd9 to 771701d Compare July 12, 2015 21:55
@stephen-hoover stephen-hoover changed the title [WIP] Add "grouped" option to Scalar classes [MRG] Add "grouped" option to Scalar classes Jul 12, 2015
@stephen-hoover
Copy link
Contributor Author

Tagging @amueller because this PR is a product of the SciPy 2015 sprints. I think this is ready for review now.

variances: float array with shape (n_features,)
Feature-wise variances
variances: float array with shape (n_features,) or scalar
Axis variances (or array mean if `axis` is None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Axis variances (or array variance if axis is None)

@TomDLT
Copy link
Member

TomDLT commented Jul 21, 2015

Looks good !

@ogrisel ogrisel changed the title [MRG] Add "grouped" option to Scalar classes [MRG] Add "grouped" option to Scaler classes Jul 21, 2015
@stephen-hoover
Copy link
Contributor Author

@TomDLT , thank you for the careful read! I fixed the issues you pointed out. I left it in a separate commit for now so you could check only my fixes; if you're satisfied, I'll squash them into the previous commit to keep the history clean.

On the test_mean_variance_axisnone, I should have read a bit more carefully -- I copy-pasted directly from the test_mean_variance_axis0 and axis1 functions, and those errors were present there. Rather than have the same tests copy-pasted six times, I combined all three of the axis tests into one function.

@stephen-hoover
Copy link
Contributor Author

...and another new commit to fix a nasty bug in the sparsefuncs_fast.sparse_mean_variance function I wrote. Includes test case which failed in the presence of the bug.

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

The tests fail under windows with:

======================================================================
ERROR: sklearn.utils.tests.test_sparsefuncs.test_mean_variance_all_axes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python34-x64\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Python34-x64\lib\site-packages\sklearn\utils\tests\test_sparsefuncs.py", line 47, in test_mean_variance_all_axes
    for dtype in [np.float32, np.float64, np.float128]:
AttributeError: 'module' object has no attribute 'float128'

Apparently the availability of numpy.float128 and its actual precision is platform dependent:

http://stackoverflow.com/questions/9062562/what-is-the-internal-precision-of-numpy-float128

Please adjust the test to run the checks with np.float128 only when hasattr(np, 'float128').

@ogrisel
Copy link
Member

ogrisel commented Jul 22, 2015

More details in http://stackoverflow.com/a/17023995/163740 . Based on this information I think we should just not use np.float128 and instead stick to np.float32 and np.float64. I don't think we need more than 64 bit precision for floating point values in a machine learning context anyway.

@stephen-hoover stephen-hoover force-pushed the grouped-feature-scaling branch from 4ccd79d to 54b00e2 Compare July 22, 2015 14:45
@stephen-hoover
Copy link
Contributor Author

As per that stackoverflow post, would np.longdouble be okay? I agree that float32 and float64 are the only ones that really matter. I just want to be certain that sparsefuncs_fast.sparse_mean_variance works properly for all input types. It has different code for three possible input types: np.float32, np.float64, and everything else. I tried testing with float16 first, but that would require adjusting the allowed tolerance on the assert_array_almost_equal checks. Using something with higher precision is a simpler solution, if it works.

@stephen-hoover stephen-hoover force-pushed the grouped-feature-scaling branch from 54b00e2 to 2314ba7 Compare July 22, 2015 18:29
@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

As per that stackoverflow post, would np.longdouble be okay?

I think we don't need to test that. This is a YAGNI. Let's keep the tests simple to maintain and focus on the useful things.

@ogrisel
Copy link
Member

ogrisel commented Jul 23, 2015

It has different code for three possible input types: np.float32, np.float64, and everything else. I tried testing with float16 first, but that would require adjusting the allowed tolerance on the assert_array_almost_equal checks. Using something with higher precision is a simpler solution, if it works.

np.float16 might be useful for some problems with tight memory constraints. +1 for adapting the tests for this. Otherwise you can also try with a complex dtype although I don't have a usecase in a machine learning context.

@stephen-hoover stephen-hoover force-pushed the grouped-feature-scaling branch from 2314ba7 to 576c84e Compare July 23, 2015 14:52
@stephen-hoover
Copy link
Contributor Author

@ogrisel , done. The test now uses np.float16 in addition to np.float32 and np.float64. The lil_matrix had trouble with np.float16, so I switched to coo_matrix for the invalid-sparse-style test.

@@ -48,6 +50,63 @@ def csr_row_norms(X):
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def sparse_mean_variance(X):
"""Compute the variance of a sparse matrix.
Will copy data to an array of doubles if input data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please insert a blank line after the first line of the docstring.

@stephen-hoover stephen-hoover force-pushed the grouped-feature-scaling branch from 576c84e to aadec16 Compare July 25, 2015 12:45
Defaults to True, which is the previous behavior. Also allow the scalar functions to take "axis=None" inputs in addition to 0 or 1.
Includes some tweaks to the sparsefuncs helper functions to deal with `axis=None` inputs.
Includes a new function `sparse_mean_variance` in "sparsefuncs_fast.pyx" to find the variance of a sparse array; that didn't seem to exist yet. This is one line in pure Python, but writing in Cython is faster and avoids extra memory use (as long as the input array has dtype np.float32 or np.float64).
Was inadvertently returning 0 instead of the correct answer for datatypes other than np.float32 and np.float64.
Every function in the module used `@cython.boundscheck(False)`, `@cython.wraparound(False)`, and `@cython.cdivision(True)`. Move those settings to module level to simplify the code.
@stephen-hoover
Copy link
Contributor Author

Pinging the people who reviewed this PR earlier: @TomDLT , @ogrisel . I think I've addressed all comments. I rebased on the current master, and tests are passing. There's two commits in here that I would squash after I get a sign-off.


"""
if axis not in (0, 1):
if axis not in (0, 1, None):
raise ValueError(
"Unknown axis value: %d. Use 0 for rows, or 1 for columns" % axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update this message?

@TomDLT
Copy link
Member

TomDLT commented Sep 14, 2015

I don't know if not handling np.float16 is a problem or not.

Otherwise it looks good to me.

@stephen-hoover
Copy link
Contributor Author

This code will handle np.float16, it'll just have to copy the array contents to an np.float64 array first. I don't think there's a way around a copy. np.float16 is non-standard, and Cython doesn't handle it.

I changed the error message you pointed out.

@cmarmo
Copy link
Contributor

cmarmo commented Sep 29, 2020

Hi @stephen-hoover ... sorry, apparently this pull request got lost.... are you still interested in finishing it? Thanks for your patience.

@stephen-hoover
Copy link
Contributor Author

@cmarmo - Oh, wow, I'd forgotten about this PR. I will take a look at it and try to figure out how much work it would take to bring it up to date. I don't have a huge amount of free time anymore, so I couldn't redo it from scratch. Is this feature still useful? If you're asking, I'm guessing the answer is yes.

@jnothman
Copy link
Member

jnothman commented Oct 8, 2020

I don't think it would hurt, but I also don't think there's substantial demand for this change

@cmarmo
Copy link
Contributor

cmarmo commented Oct 8, 2020

@stephen-hoover , thanks for your answer. I'm checking to relabel if necessary, in particular because your PR was "Waiting for Reviewer". If you are not available anymore to work on this, there is no problem.

@cmarmo cmarmo added the Low Priority Low priority issues and pull requests label Oct 22, 2020
Base automatically changed from master to main January 22, 2021 10:48
@thomasjpfan
Copy link
Member

Since the original issue is closed, I am closing this PR as well. Thank you for working on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants