Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit ab734b7

Browse files
committed
ENH add MaxAbsScaler
1 parent bdef419 commit ab734b7

File tree

5 files changed

+265
-52
lines changed

5 files changed

+265
-52
lines changed

doc/modules/preprocessing.rst

Lines changed: 60 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,10 @@ Scaling features to a range
102102
---------------------------
103103

104104
An alternative standardization is scaling features to
105-
lie between a given minimum and maximum value, often between zero and one.
106-
This can be achieved using :class:`MinMaxScaler`.
105+
lie between a given minimum and maximum value, often between zero and one,
106+
or so that the maximum absolute value of each feature is scaled to unit size.
107+
This can be achieved using :class:`MinMaxScaler` or :class:`MaxAbsScaler`,
108+
respectively.
107109

108110
The motivation to use this scaling include robustness to very small
109111
standard deviations of features and preserving zero entries in sparse data.
@@ -146,6 +148,62 @@ full formula is::
146148

147149
X_scaled = X_std / (max - min) + min
148150

151+
:class:`MaxAbsScaler` works in a very similar fashion, but scales in a way
152+
that the training data lies within the range ``[-1, 1]`` by dividing through
153+
the largest maximum value in each feature. It is meant for data
154+
that is already centered at zero or sparse data.
155+
156+
Here is how to use the toy data from the previous example with this scaler::
157+
158+
>>> X_train = np.array([[ 1., -1., 2.],
159+
... [ 2., 0., 0.],
160+
... [ 0., 1., -1.]])
161+
...
162+
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
163+
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
164+
>>> X_train_maxabs # doctest +NORMALIZE_WHITESPACE^
165+
array([[ 0.5, -1. , 1. ],
166+
[ 1. , 0. , 0. ],
167+
[ 0. , 1. , -0.5]])
168+
>>> X_test = np.array([[ -3., -1., 4.]])
169+
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
170+
>>> X_test_maxabs # doctest: +NORMALIZE_WHITESPACE
171+
array([[-1.5, -1. , 2. ]])
172+
>>> max_abs_scaler.scale_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
173+
array([ 2., 1., 2.])
174+
175+
176+
As with :func:`scale`, the module further provides a
177+
convenience function :func:`maxabs_scale` if you don't want to
178+
create an object.
179+
180+
181+
Scaling sparse data
182+
-------------------
183+
Centering sparse data would destroy the sparseness structure in the data, and
184+
thus rarely is a sensible thing to do. However, it can make sense to scale
185+
sparse inputs, especially if features are on different scales.
186+
187+
:class:`MaxAbsScaler` and :func:`maxabs_scale` were specifically designed
188+
for scaling sparse data, and are the recommended way to go about this.
189+
However, :func:`scale` and :class:`StandardScaler` can accept ``scipy.sparse``
190+
matrices as input, as long as ``with_centering=False`` is explicitly passed
191+
to the constructor. Otherwise a ``ValueError`` will be raised as
192+
silently centering would break the sparsity and would often crash the
193+
execution by allocating excessive amounts of memory unintentionally.
194+
:class:`RobustScaler` cannot be fited to sparse inputs, but you can use
195+
the ``transform`` method on sparse inputs.
196+
197+
Note that the scalers accept both Compressed Sparse Rows and Compressed
198+
Sparse Columns format (see ``scipy.sparse.csr_matrix`` and
199+
``scipy.sparse.csc_matrix``). Any other sparse input will be **converted to
200+
the Compressed Sparse Rows representation**. To avoid unnecessary memory
201+
copies, it is recommended to choose the CSR or CSC representation upstream.
202+
203+
Finally, if the centered data is expected to be small enough, explicitly
204+
converting the input to an array using the ``toarray`` method of sparse matrices
205+
is another option.
206+
149207

150208
Scaling data with outliers
151209
--------------------------
@@ -173,23 +231,6 @@ data.
173231
or :class:`sklearn.decomposition.RandomizedPCA` with ``whiten=True``
174232
to further remove the linear correlation across features.
175233

176-
.. topic:: Sparse input
177-
178-
:func:`scale` and :class:`StandardScaler` accept ``scipy.sparse`` matrices
179-
as input **only when with_mean=False is explicitly passed to the
180-
constructor**. Otherwise a ``ValueError`` will be raised as
181-
silently centering would break the sparsity and would often crash the
182-
execution by allocating excessive amounts of memory unintentionally.
183-
184-
If the centered data is expected to be small enough, explicitly convert
185-
the input to an array using the ``toarray`` method of sparse matrices
186-
instead.
187-
188-
For sparse input the data is **converted to the Compressed Sparse Rows
189-
representation** (see ``scipy.sparse.csr_matrix``).
190-
To avoid unnecessary memory copies, it is recommended to choose the CSR
191-
representation upstream.
192-
193234
.. topic:: Scaling target variables in regression
194235

195236
:func:`scale` and :class:`StandardScaler` work out-of-the-box with 1d arrays.

doc/whats_new.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,11 @@ New features
2121
alternative to :class:`preprocessing.StandardScaler` for feature-wise
2222
centering and range normalization that is robust to outliers. By `Thomas Unterthiner`_.
2323

24+
- The new class :class:`preprocessing.MaxAbsScaler` provides an
25+
alternative to :class:`preprocessing.MinMaxScaler` for feature-wise
26+
range normalization when the data is already centered or sparse.
27+
By `Thomas Unterthiner`_.
28+
2429
Enhancements
2530
............
2631

sklearn/preprocessing/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from .data import Binarizer
77
from .data import KernelCenterer
88
from .data import MinMaxScaler
9+
from .data import MaxAbsScaler
910
from .data import Normalizer
1011
from .data import RobustScaler
1112
from .data import StandardScaler
@@ -14,6 +15,7 @@
1415
from .data import normalize
1516
from .data import scale
1617
from .data import robust_scale
18+
from .data import maxabs_scale
1719
from .data import OneHotEncoder
1820

1921
from .data import PolynomialFeatures
@@ -33,6 +35,7 @@
3335
'LabelEncoder',
3436
'MultiLabelBinarizer',
3537
'MinMaxScaler',
38+
'MaxAbsScaler',
3639
'Normalizer',
3740
'OneHotEncoder',
3841
'RobustScaler',
@@ -43,5 +46,6 @@
4346
'normalize',
4447
'scale',
4548
'robust_scale',
49+
'maxabs_scale',
4650
'label_binarize',
4751
]

sklearn/preprocessing/data.py

Lines changed: 138 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
'Binarizer',
3333
'KernelCenterer',
3434
'MinMaxScaler',
35+
'MaxAbsScaler',
3536
'Normalizer',
3637
'OneHotEncoder',
3738
'RobustScaler',
@@ -41,6 +42,7 @@
4142
'normalize',
4243
'scale',
4344
'robust_scale',
45+
'maxabs_scale',
4446
]
4547

4648

@@ -59,16 +61,28 @@ def _mean_and_std(X, axis=0, with_mean=True, with_std=True):
5961

6062
if with_std:
6163
std_ = Xr.std(axis=0)
62-
if isinstance(std_, np.ndarray):
63-
std_[std_ == 0.] = 1.0
64-
elif std_ == 0.:
65-
std_ = 1.
64+
std_ = _handle_zeros_in_scale(std_)
6665
else:
6766
std_ = None
6867

6968
return mean_, std_
7069

7170

71+
def _handle_zeros_in_scale(scale):
72+
''' Makes sure that whenever scale is zero, we handle it correctly.
73+
74+
This happens in most scalers when we have constant features.'''
75+
76+
# if we are fitting on 1D arrays, scale might be a scalar
77+
if np.isscalar(scale):
78+
if scale == 0:
79+
scale = 1.
80+
elif isinstance(scale, np.ndarray):
81+
scale[scale == 0.0] = 1.0
82+
scale[~np.isfinite(scale)] = 1.0
83+
return scale
84+
85+
7286
def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
7387
"""Standardize a dataset along any axis
7488
@@ -132,7 +146,7 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
132146
if copy:
133147
X = X.copy()
134148
_, var = mean_variance_axis(X, axis=0)
135-
var[var == 0.0] = 1.0
149+
var = _handle_zeros_in_scale(var)
136150
inplace_column_scale(X, 1 / np.sqrt(var))
137151
else:
138152
X = np.asarray(X)
@@ -233,11 +247,7 @@ def fit(self, X, y=None):
233247
" than maximum. Got %s." % str(feature_range))
234248
data_min = np.min(X, axis=0)
235249
data_range = np.max(X, axis=0) - data_min
236-
# Do not scale constant features
237-
if isinstance(data_range, np.ndarray):
238-
data_range[data_range == 0.0] = 1.0
239-
elif data_range == 0.:
240-
data_range = 1.
250+
data_range = _handle_zeros_in_scale(data_range)
241251
self.scale_ = (feature_range[1] - feature_range[0]) / data_range
242252
self.min_ = feature_range[0] - data_min * self.scale_
243253
self.data_range = data_range
@@ -359,7 +369,7 @@ def fit(self, X, y=None):
359369
if self.with_std:
360370
var = mean_variance_axis(X, axis=0)[1]
361371
self.std_ = np.sqrt(var)
362-
self.std_[var == 0.0] = 1.0
372+
self.std_ = _handle_zeros_in_scale(self.std_)
363373
else:
364374
self.std_ = None
365375
return self
@@ -430,6 +440,119 @@ def inverse_transform(self, X, copy=None):
430440
return X
431441

432442

443+
class MaxAbsScaler(BaseEstimator, TransformerMixin):
444+
"""Scale each feature by its maximum absolute value.
445+
446+
This estimator scales and translates each feature individually such
447+
that the maximal absolute value of each feature in the
448+
training set will be 1.0. It does not shift/center the data, and
449+
thus does not destroy any sparsity.
450+
451+
This scaler can also be applied to sparse CSR or CSC matrices.
452+
453+
Parameters
454+
----------
455+
copy : boolean, optional, default is True
456+
Set to False to perform inplace scaling and avoid a copy (if the input
457+
is already a numpy array).
458+
459+
Attributes
460+
----------
461+
scale_ : ndarray, shape (n_features,)
462+
Per feature relative scaling of the data.
463+
"""
464+
465+
def __init__(self, copy=True):
466+
self.copy = copy
467+
468+
def fit(self, X, y=None):
469+
"""Compute the minimum and maximum to be used for later scaling.
470+
471+
Parameters
472+
----------
473+
X : array-like, shape [n_samples, n_features]
474+
The data used to compute the per-feature minimum and maximum
475+
used for later scaling along the features axis.
476+
"""
477+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
478+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
479+
if sparse.issparse(X):
480+
mins, maxs = min_max_axis(X, axis=0)
481+
scales = np.maximum(np.abs(mins), np.abs(maxs))
482+
else:
483+
scales = np.abs(X).max(axis=0)
484+
scales = np.array(scales)
485+
scales = scales.reshape(-1)
486+
self.scale_ = _handle_zeros_in_scale(scales)
487+
return self
488+
489+
def transform(self, X, y=None):
490+
"""Scale the data
491+
492+
Parameters
493+
----------
494+
X : array-like or CSR matrix.
495+
The data that should be scaled.
496+
"""
497+
check_is_fitted(self, 'scale_')
498+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
499+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
500+
if sparse.issparse(X):
501+
if X.shape[0] == 1:
502+
inplace_row_scale(X, 1.0 / self.scale_)
503+
else:
504+
inplace_column_scale(X, 1.0 / self.scale_)
505+
else:
506+
X /= self.scale_
507+
return X
508+
509+
def inverse_transform(self, X):
510+
"""Scale back the data to the original representation
511+
512+
Parameters
513+
----------
514+
X : array-like or CSR matrix.
515+
The data that should be transformed back.
516+
"""
517+
check_is_fitted(self, 'scale_')
518+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
519+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
520+
if sparse.issparse(X):
521+
if X.shape[0] == 1:
522+
inplace_row_scale(X, self.scale_)
523+
else:
524+
inplace_column_scale(X, self.scale_)
525+
else:
526+
X *= self.scale_
527+
return X
528+
529+
530+
def maxabs_scale(X, axis=0, copy=True):
531+
"""Scale each feature to the [-1, 1] range without breaking the sparsity.
532+
533+
This estimator scales each feature individually such
534+
that the maximal absolute value of each feature in the
535+
training set will be 1.0.
536+
537+
This scaler can also be applied to sparse CSR or CSC matrices.
538+
539+
Parameters
540+
----------
541+
axis : int (0 by default)
542+
axis used to scale along. If 0, independently scale each feature,
543+
otherwise (if 1) scale each sample.
544+
545+
copy : boolean, optional, default is True
546+
Set to False to perform inplace scaling and avoid a copy (if the input
547+
is already a numpy array).
548+
"""
549+
s = MaxAbsScaler(copy=copy)
550+
if axis == 0:
551+
return s.fit_transform(X)
552+
else:
553+
return s.fit_transform(X.T).T
554+
555+
433556
class RobustScaler(BaseEstimator, TransformerMixin):
434557
"""Scale features using statistics that are robust to outliers.
435558
@@ -498,28 +621,15 @@ def __init__(self, with_centering=True, with_scaling=True, copy=True):
498621

499622
def _check_array(self, X, copy):
500623
"""Makes sure centering is not enabled for sparse matrices."""
501-
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=np.float,
502-
copy=copy, ensure_2d=False)
624+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
625+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
503626
if sparse.issparse(X):
504627
if self.with_centering:
505628
raise ValueError(
506629
"Cannot center sparse matrices: use `with_centering=False`"
507630
" instead. See docstring for motivation and alternatives.")
508631
return X
509632

510-
def _handle_zeros_in_scale(self, scale):
511-
''' Makes sure that whenever scale is zero, we handle it correctly.
512-
513-
This happens in most scalers when we have constant features.'''
514-
# if we are fitting on 1D arrays, scale might be a scalar
515-
if np.isscalar(scale):
516-
if scale == 0:
517-
scale = 1.
518-
elif isinstance(scale, np.ndarray):
519-
scale[scale == 0.0] = 1.0
520-
scale[~np.isfinite(scale)] = 1.0
521-
return scale
522-
523633
def fit(self, X, y=None):
524634
"""Compute the median and quantiles to be used for scaling.
525635
@@ -539,12 +649,7 @@ def fit(self, X, y=None):
539649
if self.with_scaling:
540650
q = np.percentile(X, (25, 75), axis=0)
541651
self.scale_ = (q[1] - q[0])
542-
if np.isscalar(self.scale_):
543-
if self.scale_ == 0:
544-
self.scale_ = 1.
545-
else:
546-
self.scale_[self.scale_ == 0.0] = 1.0
547-
self.scale_[~np.isfinite(self.scale_)] = 1.0
652+
self.scale_ = _handle_zeros_in_scale(self.scale_)
548653
return self
549654

550655
def transform(self, X, y=None):
@@ -847,7 +952,7 @@ def normalize(X, norm='l2', axis=1, copy=True):
847952
norms = row_norms(X)
848953
elif norm == 'max':
849954
norms = np.max(X, axis=1)
850-
norms[norms == 0.0] = 1.0
955+
norms = _handle_zeros_in_scale(norms)
851956
X /= norms[:, np.newaxis]
852957

853958
if axis == 0:

0 commit comments

Comments
 (0)