-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
2479fda
axis argument to sparsefuncs.mean_variance_axis.
06ecf48
BaseScaler abstraction.
a5d24a8
minmax_scale
8bdc83e
Refactor scaling tests
d3903c3
DOC MinMaxScaler / minmax_scale
28df99f
RobustScaler
b918e66
MAxAbsScaler
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -72,13 +72,14 @@ This class is hence suitable for use in the early steps of a | |
:class:`sklearn.pipeline.Pipeline`:: | ||
|
||
>>> scaler = preprocessing.StandardScaler().fit(X) | ||
>>> scaler | ||
StandardScaler(copy=True, with_mean=True, with_std=True) | ||
>>> scaler # doctest: +NORMALIZE_WHITESPACE | ||
StandardScaler(axis=0, copy=True, with_centering=True, with_mean=None, | ||
with_scaling=True, with_std=None) | ||
|
||
>>> scaler.mean_ # doctest: +ELLIPSIS | ||
>>> scaler.center_ # doctest: +ELLIPSIS | ||
array([ 1. ..., 0. ..., 0.33...]) | ||
|
||
>>> scaler.std_ # doctest: +ELLIPSIS | ||
>>> scaler.scale_ # doctest: +ELLIPSIS | ||
array([ 0.81..., 0.81..., 1.24...]) | ||
|
||
>>> scaler.transform(X) # doctest: +ELLIPSIS | ||
|
@@ -94,17 +95,19 @@ same way it did on the training set:: | |
array([[-2.44..., 1.22..., -0.26...]]) | ||
|
||
It is possible to disable either centering or scaling by either | ||
passing ``with_mean=False`` or ``with_std=False`` to the constructor | ||
passing ``with_centering=False`` or ``with_scaling=False`` to the constructor | ||
of :class:`StandardScaler`. | ||
|
||
|
||
Scaling features to a range | ||
--------------------------- | ||
An alternative standardization is scaling features to | ||
lie between a given minimum and maximum value, often between zero and one. | ||
This can be achieved using :class:`MinMaxScaler`. | ||
lie between a given minimum and maximum value, such as between zero and one, or | ||
so that the maximum value of each feature is scaled to unit size. | ||
This can be achieved using :class:`MinMaxScaler` or :class:`MaxAbsScaler`, | ||
respectively. | ||
|
||
The motivation to use this scaling include robustness to very small | ||
The motivation to use such a scaling include robustness to very small | ||
standard deviations of features and preserving zero entries in sparse data. | ||
|
||
Here is an example to scale a toy data matrix to the ``[0, 1]`` range:: | ||
|
@@ -132,11 +135,8 @@ applied to be consistent with the transformation performed on the train data:: | |
It is possible to introspect the scaler attributes to find about the exact | ||
nature of the transformation learned on the training data:: | ||
|
||
>>> min_max_scaler.scale_ # doctest: +ELLIPSIS | ||
array([ 0.5 , 0.5 , 0.33...]) | ||
|
||
>>> min_max_scaler.min_ # doctest: +ELLIPSIS | ||
array([ 0. , 0.5 , 0.33...]) | ||
>>> min_max_scaler.scale_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE | ||
array([ 2., 2., 3.]) | ||
|
||
If :class:`MinMaxScaler` is given an explicit ``feature_range=(min, max)`` the | ||
full formula is:: | ||
|
@@ -145,6 +145,76 @@ full formula is:: | |
|
||
X_scaled = X_std / (max - min) + min | ||
|
||
As with :func:`scale`, the ``preprocessing`` module further provides a | ||
convenience function function :func:`minmax_scale` if you don't want to use | ||
the `Transformer` API. | ||
|
||
:class:`MaxAbsScaler` works in a very similar fashion, but scales data so | ||
it lies within the range ``[-1, 1]``, and is meant for data | ||
that is already centered at zero. In particular, this scaler is very well | ||
suited for sparse data. | ||
|
||
Here is how to use the toy data from the previous example with this scaler:: | ||
|
||
>>> X_train = np.array([[ 1., -1., 2.], | ||
... [ 2., 0., 0.], | ||
... [ 0., 1., -1.]]) | ||
... | ||
>>> max_abs_scaler = preprocessing.MaxAbsScaler() | ||
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train) | ||
>>> X_train_maxabs #doctest +NORMALIZE_WHITESPACE^ | ||
array([[ 0.5, -1. , 1. ], | ||
[ 1. , 0. , 0. ], | ||
[ 0. , 1. , -0.5]]) | ||
>>> X_test = np.array([[ -3., -1., 4.]]) | ||
>>> X_test_maxabs = max_abs_scaler.transform(X_test) | ||
>>> X_test_maxabs # doctest: +NORMALIZE_WHITESPACE | ||
array([[-1.5, -1. , 2. ]]) | ||
>>> max_abs_scaler.scale_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE | ||
array([ 2., 1., 2.]) | ||
|
||
|
||
As with :func:`scale`, the ``preprocessing`` module further provides a | ||
convenience function function :func:`maxabs_scale` if you don't want to use | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
the `Transformer` API. | ||
|
||
|
||
Scaling sparse data | ||
------------------- | ||
Centering sparse data would destroy the sparseness structure in the data, and | ||
thus rarely is a sensible thing to do. However, it can make sense to scale | ||
sparse inputs, especially if features are on different scales. | ||
|
||
:class:`MaxAbsScaler` and :func:`maxabs_scale` were specifically designed | ||
for scaling sparse data, and are the recommended way to go about this. | ||
However, :func:`scale` and :class:`StandardScaler` can accept ``scipy.sparse`` | ||
matrices as input, as long as ``with_centering=False`` is explicitly passed | ||
to the constructor. Otherwise a ``ValueError`` will be raised as | ||
silently centering would break the sparsity and would often crash the | ||
execution by allocating excessive amounts of memory unintentionally. | ||
:class:`RobustScaler` cannot be `fit`ted to sparse inputs, but you can use the | ||
`transform` method on sparse inputs. | ||
|
||
Note that the scalers accept both Compressed Sparse Rows and Compressed | ||
Sparse Columns format (see ``scipy.sparse.csr_matrix`` and | ||
``scipy.sparse.csc_matrix``). Any other sparse input will be **converted to | ||
the Compressed Sparse Rows representation**. To avoid unnecessary memory | ||
copies, it is recommended to choose the CSR or CSC representation upstream. | ||
|
||
Finally, if the centered data is expected to be small enough, explicitly | ||
converting the input to an array using the ``toarray`` method of sparse matrices | ||
is another option. | ||
|
||
|
||
Scaling data with outliers | ||
-------------------------- | ||
If your data contains many outliers, scaling using the mean and variance | ||
of the data does sometimes not work very well. In these cases, you can use | ||
:func:`robust_scale` and :class:`RobustScaler` as drop-in replacements | ||
instead, which use more robust estimates for the center and range of your | ||
data. | ||
|
||
|
||
.. topic:: References: | ||
|
||
Further discussion on the importance of centering and scaling data is | ||
|
@@ -161,26 +231,9 @@ full formula is:: | |
or :class:`sklearn.decomposition.RandomizedPCA` with ``whiten=True`` | ||
to further remove the linear correlation across features. | ||
|
||
.. topic:: Sparse input | ||
|
||
:func:`scale` and :class:`StandardScaler` accept ``scipy.sparse`` matrices | ||
as input **only when with_mean=False is explicitly passed to the | ||
constructor**. Otherwise a ``ValueError`` will be raised as | ||
silently centering would break the sparsity and would often crash the | ||
execution by allocating excessive amounts of memory unintentionally. | ||
|
||
If the centered data is expected to be small enough, explicitly convert | ||
the input to an array using the ``toarray`` method of sparse matrices | ||
instead. | ||
|
||
For sparse input the data is **converted to the Compressed Sparse Rows | ||
representation** (see ``scipy.sparse.csr_matrix``). | ||
To avoid unnecessary memory copies, it is recommended to choose the CSR | ||
representation upstream. | ||
|
||
.. topic:: Scaling target variables in regression | ||
|
||
:func:`scale` and :class:`StandardScaler` work out-of-the-box with 1d arrays. | ||
All scaling functions and classes work out-of-the-box with 1d arrays. | ||
This is very useful for scaling the target / response variables used | ||
for regression. | ||
|
||
|
@@ -526,6 +579,5 @@ similarly. | |
|
||
Note that if features have very different scaling or statistical | ||
properties, :class:`cluster.FeatureAgglomeration` maye not be able to | ||
capture the links between related features. Using a | ||
capture the links between related features. Using a | ||
:class:`preprocessing.StandardScaler` can be useful in these settings. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me uncomfortable: it seems to suggest there is no benefit in using the Transformer API. Perhaps something more circumspect like "Where appropriate, the convenience function :func:
minmax_scale
can be used instead of a transformer."In any case, you say "function function"
You also say