[MRG] Add "grouped" option to Scaler classes #4963

stephen-hoover · 2015-07-11T22:05:34Z

As per discussion in issue #4892 , add a "per_feature" option to the Scaler classes. When False (defaults to the previous behavior of True), this modifies behavior such that scaling is based on a consideration of the entire data array at once, instead of one feature at a time. Also allow "axis=None" in addition to axis=0 or axis=1 in the standalone scaling functions.

This PR includes tweaks to functions in the "sparsefuncs" module where it makes axis=None behavior easier to code.

TODO:

Add "per_feature" option to RobustScaler
Add "axis=None" option to preprocessing.data.scale
Add "axis=None" option to preprocessing.data.maxabs_scale
Add "axis=None" option to preprocessing.data.robust_scale

stephen-hoover · 2015-07-12T23:00:44Z

Tagging @amueller because this PR is a product of the SciPy 2015 sprints. I think this is ready for review now.

TomDLT · 2015-07-21T15:09:29Z

sklearn/utils/sparsefuncs.py

-    variances: float array with shape (n_features,)
-        Feature-wise variances
+    variances: float array with shape (n_features,) or scalar
+        Axis variances (or array mean if `axis` is None)


Axis variances (or array variance if axis is None)

TomDLT · 2015-07-21T15:55:27Z

Looks good !

stephen-hoover · 2015-07-22T01:37:39Z

@TomDLT , thank you for the careful read! I fixed the issues you pointed out. I left it in a separate commit for now so you could check only my fixes; if you're satisfied, I'll squash them into the previous commit to keep the history clean.

On the test_mean_variance_axisnone, I should have read a bit more carefully -- I copy-pasted directly from the test_mean_variance_axis0 and axis1 functions, and those errors were present there. Rather than have the same tests copy-pasted six times, I combined all three of the axis tests into one function.

stephen-hoover · 2015-07-22T01:59:20Z

...and another new commit to fix a nasty bug in the sparsefuncs_fast.sparse_mean_variance function I wrote. Includes test case which failed in the presence of the bug.

ogrisel · 2015-07-22T07:15:14Z

The tests fail under windows with:

======================================================================
ERROR: sklearn.utils.tests.test_sparsefuncs.test_mean_variance_all_axes
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python34-x64\lib\site-packages\nose\case.py", line 198, in runTest
    self.test(*self.arg)
  File "C:\Python34-x64\lib\site-packages\sklearn\utils\tests\test_sparsefuncs.py", line 47, in test_mean_variance_all_axes
    for dtype in [np.float32, np.float64, np.float128]:
AttributeError: 'module' object has no attribute 'float128'

Apparently the availability of numpy.float128 and its actual precision is platform dependent:

http://stackoverflow.com/questions/9062562/what-is-the-internal-precision-of-numpy-float128

Please adjust the test to run the checks with np.float128 only when hasattr(np, 'float128').

ogrisel · 2015-07-22T07:33:51Z

More details in http://stackoverflow.com/a/17023995/163740 . Based on this information I think we should just not use np.float128 and instead stick to np.float32 and np.float64. I don't think we need more than 64 bit precision for floating point values in a machine learning context anyway.

stephen-hoover · 2015-07-22T14:50:10Z

As per that stackoverflow post, would np.longdouble be okay? I agree that float32 and float64 are the only ones that really matter. I just want to be certain that sparsefuncs_fast.sparse_mean_variance works properly for all input types. It has different code for three possible input types: np.float32, np.float64, and everything else. I tried testing with float16 first, but that would require adjusting the allowed tolerance on the assert_array_almost_equal checks. Using something with higher precision is a simpler solution, if it works.

ogrisel · 2015-07-23T11:05:07Z

As per that stackoverflow post, would np.longdouble be okay?

I think we don't need to test that. This is a YAGNI. Let's keep the tests simple to maintain and focus on the useful things.

ogrisel · 2015-07-23T11:08:35Z

It has different code for three possible input types: np.float32, np.float64, and everything else. I tried testing with float16 first, but that would require adjusting the allowed tolerance on the assert_array_almost_equal checks. Using something with higher precision is a simpler solution, if it works.

np.float16 might be useful for some problems with tight memory constraints. +1 for adapting the tests for this. Otherwise you can also try with a complex dtype although I don't have a usecase in a machine learning context.

stephen-hoover · 2015-07-23T17:12:01Z

@ogrisel , done. The test now uses np.float16 in addition to np.float32 and np.float64. The lil_matrix had trouble with np.float16, so I switched to coo_matrix for the invalid-sparse-style test.

ogrisel · 2015-07-23T23:12:39Z

sklearn/utils/sparsefuncs_fast.pyx

@@ -48,6 +50,63 @@ def csr_row_norms(X):
 @cython.boundscheck(False)
 @cython.wraparound(False)
 @cython.cdivision(True)
+def sparse_mean_variance(X):
+    """Compute the variance of a sparse matrix.
+    Will copy data to an array of doubles if input data


Please insert a blank line after the first line of the docstring.

Defaults to True, which is the previous behavior. Also allow the scalar functions to take "axis=None" inputs in addition to 0 or 1. Includes some tweaks to the sparsefuncs helper functions to deal with `axis=None` inputs. Includes a new function `sparse_mean_variance` in "sparsefuncs_fast.pyx" to find the variance of a sparse array; that didn't seem to exist yet. This is one line in pure Python, but writing in Cython is faster and avoids extra memory use (as long as the input array has dtype np.float32 or np.float64).

Was inadvertently returning 0 instead of the correct answer for datatypes other than np.float32 and np.float64.

Every function in the module used `@cython.boundscheck(False)`, `@cython.wraparound(False)`, and `@cython.cdivision(True)`. Move those settings to module level to simplify the code.

stephen-hoover · 2015-09-13T21:13:21Z

Pinging the people who reviewed this PR earlier: @TomDLT , @ogrisel . I think I've addressed all comments. I rebased on the current master, and tests are passing. There's two commits in here that I would squash after I get a sign-off.

TomDLT · 2015-09-14T11:03:13Z

sklearn/utils/sparsefuncs.py


    """
-    if axis not in (0, 1):
+    if axis not in (0, 1, None):
        raise ValueError(
            "Unknown axis value: %d. Use 0 for rows, or 1 for columns" % axis)


should we update this message?

TomDLT · 2015-09-14T11:24:52Z

I don't know if not handling np.float16 is a problem or not.

Otherwise it looks good to me.

stephen-hoover · 2015-09-14T21:31:50Z

This code will handle np.float16, it'll just have to copy the array contents to an np.float64 array first. I don't think there's a way around a copy. np.float16 is non-standard, and Cython doesn't handle it.

I changed the error message you pointed out.

cmarmo · 2020-09-29T12:19:15Z

Hi @stephen-hoover ... sorry, apparently this pull request got lost.... are you still interested in finishing it? Thanks for your patience.

stephen-hoover · 2020-10-07T23:53:43Z

@cmarmo - Oh, wow, I'd forgotten about this PR. I will take a look at it and try to figure out how much work it would take to bring it up to date. I don't have a huge amount of free time anymore, so I couldn't redo it from scratch. Is this feature still useful? If you're asking, I'm guessing the answer is yes.

jnothman · 2020-10-08T01:52:59Z

I don't think it would hurt, but I also don't think there's substantial demand for this change

cmarmo · 2020-10-08T06:32:49Z

@stephen-hoover , thanks for your answer. I'm checking to relabel if necessary, in particular because your PR was "Waiting for Reviewer". If you are not available anymore to work on this, there is no problem.

thomasjpfan · 2022-07-22T17:55:57Z

Since the original issue is closed, I am closing this PR as well. Thank you for working on this PR.

stephen-hoover force-pushed the grouped-feature-scaling branch 4 times, most recently from 495fdd9 to 771701d Compare July 12, 2015 21:55

stephen-hoover changed the title ~~[WIP] Add "grouped" option to Scalar classes~~ [MRG] Add "grouped" option to Scalar classes Jul 12, 2015

TomDLT reviewed Jul 21, 2015
View reviewed changes

ogrisel changed the title ~~[MRG] Add "grouped" option to Scalar classes~~ [MRG] Add "grouped" option to Scaler classes Jul 21, 2015

stephen-hoover force-pushed the grouped-feature-scaling branch from 4ccd79d to 54b00e2 Compare July 22, 2015 14:45

stephen-hoover force-pushed the grouped-feature-scaling branch from 54b00e2 to 2314ba7 Compare July 22, 2015 18:29

stephen-hoover force-pushed the grouped-feature-scaling branch from 2314ba7 to 576c84e Compare July 23, 2015 14:52

ogrisel reviewed Jul 23, 2015
View reviewed changes

stephen-hoover force-pushed the grouped-feature-scaling branch from 576c84e to aadec16 Compare July 25, 2015 12:45

stephen-hoover added 3 commits September 13, 2015 15:07

TST: Fix copy-paste errors in scaler and sparsefuncs tests

73e84bc

BUG: Fix sparse_mean_variance for non-double/floats

07220f4

Was inadvertently returning 0 instead of the correct answer for datatypes other than np.float32 and np.float64.

MAINT Cython settings to module level in sparsefuncs_fast

1547b8e

Every function in the module used `@cython.boundscheck(False)`, `@cython.wraparound(False)`, and `@cython.cdivision(True)`. Move those settings to module level to simplify the code.

stephen-hoover force-pushed the grouped-feature-scaling branch from 6bbc583 to 1547b8e Compare September 13, 2015 20:10

TomDLT reviewed Sep 14, 2015
View reviewed changes

MAINT Update error message in mean_variance_axis

0c095ff

amueller added the Waiting for Reviewer label Dec 10, 2015

github-actions bot added module:pipeline module:preprocessing module:utils labels Mar 2, 2020

cmarmo removed the Waiting for Reviewer label Oct 8, 2020

cmarmo added the Low Priority Low priority issues and pull requests label Oct 22, 2020

Base automatically changed from master to main January 22, 2021 10:48

thomasjpfan added the cython label Apr 13, 2021

thomasjpfan mentioned this pull request Jul 22, 2022

sklearn.preprocessing.MinMaxScaler not preserving symmetry / Add axis=None #4892

Closed

thomasjpfan closed this Jul 22, 2022

Uh oh!

[MRG] Add "grouped" option to Scaler classes #4963

[MRG] Add "grouped" option to Scaler classes #4963

Uh oh!

Conversation

stephen-hoover commented Jul 11, 2015

Uh oh!

stephen-hoover commented Jul 12, 2015

Uh oh!

TomDLT Jul 21, 2015

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Jul 21, 2015

Uh oh!

stephen-hoover commented Jul 22, 2015

Uh oh!

stephen-hoover commented Jul 22, 2015

Uh oh!

ogrisel commented Jul 22, 2015

Uh oh!

ogrisel commented Jul 22, 2015

Uh oh!

stephen-hoover commented Jul 22, 2015

Uh oh!

ogrisel commented Jul 23, 2015

Uh oh!

ogrisel commented Jul 23, 2015

Uh oh!

stephen-hoover commented Jul 23, 2015

Uh oh!

ogrisel Jul 23, 2015

Choose a reason for hiding this comment

Uh oh!

stephen-hoover commented Sep 13, 2015

Uh oh!

TomDLT Sep 14, 2015

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Sep 14, 2015

Uh oh!

stephen-hoover commented Sep 14, 2015

Uh oh!

cmarmo commented Sep 29, 2020

Uh oh!

stephen-hoover commented Oct 7, 2020

Uh oh!

jnothman commented Oct 8, 2020

Uh oh!

cmarmo commented Oct 8, 2020

Uh oh!

thomasjpfan commented Jul 22, 2022

Uh oh!

Uh oh!