Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit fadca42

Browse files
committed
ENH Add MaxAbsScaler
1 parent bdef419 commit fadca42

File tree

4 files changed

+250
-51
lines changed

4 files changed

+250
-51
lines changed

doc/modules/preprocessing.rst

Lines changed: 60 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,10 @@ Scaling features to a range
102102
---------------------------
103103

104104
An alternative standardization is scaling features to
105-
lie between a given minimum and maximum value, often between zero and one.
106-
This can be achieved using :class:`MinMaxScaler`.
105+
lie between a given minimum and maximum value, often between zero and one,
106+
or so that the maximum value of each feature is scaled to unit size.
107+
This can be achieved using :class:`MinMaxScaler` or :class:`MaxAbsScaler`,
108+
respectively.
107109

108110
The motivation to use this scaling include robustness to very small
109111
standard deviations of features and preserving zero entries in sparse data.
@@ -146,6 +148,62 @@ full formula is::
146148

147149
X_scaled = X_std / (max - min) + min
148150

151+
:class:`MaxAbsScaler` works in a very similar fashion, but scales data so
152+
it lies within the range ``[-1, 1]``, and is meant for data
153+
that is already centered at zero. In particular, this scaler is very well
154+
suited for sparse data.
155+
156+
Here is how to use the toy data from the previous example with this scaler::
157+
158+
>>> X_train = np.array([[ 1., -1., 2.],
159+
... [ 2., 0., 0.],
160+
... [ 0., 1., -1.]])
161+
...
162+
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
163+
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
164+
>>> X_train_maxabs # doctest +NORMALIZE_WHITESPACE^
165+
array([[ 0.5, -1. , 1. ],
166+
[ 1. , 0. , 0. ],
167+
[ 0. , 1. , -0.5]])
168+
>>> X_test = np.array([[ -3., -1., 4.]])
169+
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
170+
>>> X_test_maxabs # doctest: +NORMALIZE_WHITESPACE
171+
array([[-1.5, -1. , 2. ]])
172+
>>> max_abs_scaler.scale_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
173+
array([ 2., 1., 2.])
174+
175+
176+
As with :func:`scale`, the ``preprocessing`` module further provides a
177+
convenience function function :func:`maxabs_scale` if you don't want to use
178+
the `Transformer` API.
179+
180+
181+
Scaling sparse data
182+
-------------------
183+
Centering sparse data would destroy the sparseness structure in the data, and
184+
thus rarely is a sensible thing to do. However, it can make sense to scale
185+
sparse inputs, especially if features are on different scales.
186+
187+
:class:`MaxAbsScaler` and :func:`maxabs_scale` were specifically designed
188+
for scaling sparse data, and are the recommended way to go about this.
189+
However, :func:`scale` and :class:`StandardScaler` can accept ``scipy.sparse``
190+
matrices as input, as long as ``with_centering=False`` is explicitly passed
191+
to the constructor. Otherwise a ``ValueError`` will be raised as
192+
silently centering would break the sparsity and would often crash the
193+
execution by allocating excessive amounts of memory unintentionally.
194+
:class:`RobustScaler` cannot be `fit`ted to sparse inputs, but you can use the
195+
`transform` method on sparse inputs.
196+
197+
Note that the scalers accept both Compressed Sparse Rows and Compressed
198+
Sparse Columns format (see ``scipy.sparse.csr_matrix`` and
199+
``scipy.sparse.csc_matrix``). Any other sparse input will be **converted to
200+
the Compressed Sparse Rows representation**. To avoid unnecessary memory
201+
copies, it is recommended to choose the CSR or CSC representation upstream.
202+
203+
Finally, if the centered data is expected to be small enough, explicitly
204+
converting the input to an array using the ``toarray`` method of sparse matrices
205+
is another option.
206+
149207

150208
Scaling data with outliers
151209
--------------------------
@@ -173,23 +231,6 @@ data.
173231
or :class:`sklearn.decomposition.RandomizedPCA` with ``whiten=True``
174232
to further remove the linear correlation across features.
175233

176-
.. topic:: Sparse input
177-
178-
:func:`scale` and :class:`StandardScaler` accept ``scipy.sparse`` matrices
179-
as input **only when with_mean=False is explicitly passed to the
180-
constructor**. Otherwise a ``ValueError`` will be raised as
181-
silently centering would break the sparsity and would often crash the
182-
execution by allocating excessive amounts of memory unintentionally.
183-
184-
If the centered data is expected to be small enough, explicitly convert
185-
the input to an array using the ``toarray`` method of sparse matrices
186-
instead.
187-
188-
For sparse input the data is **converted to the Compressed Sparse Rows
189-
representation** (see ``scipy.sparse.csr_matrix``).
190-
To avoid unnecessary memory copies, it is recommended to choose the CSR
191-
representation upstream.
192-
193234
.. topic:: Scaling target variables in regression
194235

195236
:func:`scale` and :class:`StandardScaler` work out-of-the-box with 1d arrays.

sklearn/preprocessing/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from .data import Binarizer
77
from .data import KernelCenterer
88
from .data import MinMaxScaler
9+
from .data import MaxAbsScaler
910
from .data import Normalizer
1011
from .data import RobustScaler
1112
from .data import StandardScaler
@@ -33,6 +34,7 @@
3334
'LabelEncoder',
3435
'MultiLabelBinarizer',
3536
'MinMaxScaler',
37+
'MaxAbsScaler',
3638
'Normalizer',
3739
'OneHotEncoder',
3840
'RobustScaler',
@@ -43,5 +45,6 @@
4345
'normalize',
4446
'scale',
4547
'robust_scale',
48+
'maxabs_scale',
4649
'label_binarize',
4750
]

sklearn/preprocessing/data.py

Lines changed: 138 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
'Binarizer',
3333
'KernelCenterer',
3434
'MinMaxScaler',
35+
'MaxAbsScaler',
3536
'Normalizer',
3637
'OneHotEncoder',
3738
'RobustScaler',
@@ -41,6 +42,7 @@
4142
'normalize',
4243
'scale',
4344
'robust_scale',
45+
'maxabs_scale',
4446
]
4547

4648

@@ -59,16 +61,28 @@ def _mean_and_std(X, axis=0, with_mean=True, with_std=True):
5961

6062
if with_std:
6163
std_ = Xr.std(axis=0)
62-
if isinstance(std_, np.ndarray):
63-
std_[std_ == 0.] = 1.0
64-
elif std_ == 0.:
65-
std_ = 1.
64+
std_ = _handle_zeros_in_scale(std_)
6665
else:
6766
std_ = None
6867

6968
return mean_, std_
7069

7170

71+
def _handle_zeros_in_scale(scale):
72+
''' Makes sure that whenever scale is zero, we handle it correctly.
73+
74+
This happens in most scalers when we have constant features.'''
75+
76+
# if we are fitting on 1D arrays, scale might be a scalar
77+
if np.isscalar(scale):
78+
if scale == 0:
79+
scale = 1.
80+
elif isinstance(scale, np.ndarray):
81+
scale[scale == 0.0] = 1.0
82+
scale[~np.isfinite(scale)] = 1.0
83+
return scale
84+
85+
7286
def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
7387
"""Standardize a dataset along any axis
7488
@@ -132,7 +146,7 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
132146
if copy:
133147
X = X.copy()
134148
_, var = mean_variance_axis(X, axis=0)
135-
var[var == 0.0] = 1.0
149+
var = _handle_zeros_in_scale(var)
136150
inplace_column_scale(X, 1 / np.sqrt(var))
137151
else:
138152
X = np.asarray(X)
@@ -233,11 +247,7 @@ def fit(self, X, y=None):
233247
" than maximum. Got %s." % str(feature_range))
234248
data_min = np.min(X, axis=0)
235249
data_range = np.max(X, axis=0) - data_min
236-
# Do not scale constant features
237-
if isinstance(data_range, np.ndarray):
238-
data_range[data_range == 0.0] = 1.0
239-
elif data_range == 0.:
240-
data_range = 1.
250+
data_range = _handle_zeros_in_scale(data_range)
241251
self.scale_ = (feature_range[1] - feature_range[0]) / data_range
242252
self.min_ = feature_range[0] - data_min * self.scale_
243253
self.data_range = data_range
@@ -359,7 +369,7 @@ def fit(self, X, y=None):
359369
if self.with_std:
360370
var = mean_variance_axis(X, axis=0)[1]
361371
self.std_ = np.sqrt(var)
362-
self.std_[var == 0.0] = 1.0
372+
self.std_ = _handle_zeros_in_scale(self.std_)
363373
else:
364374
self.std_ = None
365375
return self
@@ -430,6 +440,119 @@ def inverse_transform(self, X, copy=None):
430440
return X
431441

432442

443+
class MaxAbsScaler(BaseEstimator, TransformerMixin):
444+
"""Scale each feature to the [-1, 1] range without breaking the sparsity.
445+
446+
This estimator scales and translates each feature individually such
447+
that the maximal absolute value of each feature in the
448+
training set will be 1.0.
449+
450+
This scaler can also be applied to sparse CSR or CSC matrices.
451+
452+
Parameters
453+
----------
454+
copy : boolean, optional, default is True
455+
Set to False to perform inplace scaling and avoid a copy (if the input
456+
is already a numpy array).
457+
458+
Attributes
459+
----------
460+
`scale_` : ndarray, shape (n_features,)
461+
Per feature relative scaling of the data.
462+
"""
463+
464+
def __init__(self, copy=True):
465+
self.copy = copy
466+
467+
def fit(self, X, y=None):
468+
"""Compute the minimum and maximum to be used for later scaling.
469+
470+
Parameters
471+
----------
472+
X : array-like, shape [n_samples, n_features]
473+
The data used to compute the per-feature minimum and maximum
474+
used for later scaling along the features axis.
475+
"""
476+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
477+
ensure_2d=False, warn_on_dtype=True,
478+
estimator=self, dtype=FLOAT_DTYPES)
479+
if sparse.issparse(X):
480+
mins, maxs = min_max_axis(X, axis=0)
481+
scales = np.maximum(np.abs(mins), np.abs(maxs))
482+
else:
483+
scales = np.abs(X).max(axis=0)
484+
scales = np.array(scales)
485+
scales = scales.reshape(-1)
486+
self.scale_ = _handle_zeros_in_scale(scales)
487+
return self
488+
489+
def transform(self, X, y=None):
490+
"""Scale the data
491+
492+
Parameters
493+
----------
494+
X : array-like or CSR matrix.
495+
The data that should be scaled.
496+
"""
497+
check_is_fitted(self, 'scale_')
498+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
499+
ensure_2d=False, warn_on_dtype=True,
500+
estimator=self, dtype=FLOAT_DTYPES)
501+
if sparse.issparse(X):
502+
if X.shape[0] == 1:
503+
inplace_row_scale(X, 1.0 / self.scale_)
504+
elif self.axis == 0:
505+
inplace_column_scale(X, 1.0 / self.scale_)
506+
else:
507+
X /= self.scale_
508+
return X
509+
510+
def inverse_transform(self, X):
511+
"""Scale back the data to the original representation
512+
513+
Parameters
514+
----------
515+
X : array-like or CSR matrix.
516+
The data that should be transformed back.
517+
"""
518+
check_is_fitted(self, 'scale_')
519+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
520+
ensure_2d=False, warn_on_dtype=True,
521+
estimator=self, dtype=FLOAT_DTYPES)
522+
if sparse.issparse(X):
523+
if X.shape[0] == 1:
524+
inplace_row_scale(X, self.scale_)
525+
else:
526+
inplace_column_scale(X, self.scale_)
527+
else:
528+
X *= self.scale_
529+
return X
530+
531+
532+
def maxabs_scale(X, copy=True):
533+
"""Scale each feature to the [-1, 1] range without breaking the sparsity.
534+
535+
This estimator scales and translates each feature individually such
536+
that the maximal absolute value of each feature in the
537+
training set will be 1.0.
538+
539+
This scaler can also be applied to sparse CSR or CSC matrices.
540+
541+
Parameters
542+
----------
543+
copy : boolean, optional, default is True
544+
Set to False to perform inplace scaling and avoid a copy (if the input
545+
is already a numpy array).
546+
547+
Attributes
548+
----------
549+
`scale_` : ndarray, shape (n_features,)
550+
Per feature relative scaling of the data.
551+
"""
552+
s = MaxAbsScaler(copy=copy)
553+
return s.fit_transform(X)
554+
555+
433556
class RobustScaler(BaseEstimator, TransformerMixin):
434557
"""Scale features using statistics that are robust to outliers.
435558
@@ -498,28 +621,16 @@ def __init__(self, with_centering=True, with_scaling=True, copy=True):
498621

499622
def _check_array(self, X, copy):
500623
"""Makes sure centering is not enabled for sparse matrices."""
501-
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=np.float,
502-
copy=copy, ensure_2d=False)
624+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
625+
ensure_2d=False, warn_on_dtype=True,
626+
estimator=self, dtype=FLOAT_DTYPES)
503627
if sparse.issparse(X):
504628
if self.with_centering:
505629
raise ValueError(
506630
"Cannot center sparse matrices: use `with_centering=False`"
507631
" instead. See docstring for motivation and alternatives.")
508632
return X
509633

510-
def _handle_zeros_in_scale(self, scale):
511-
''' Makes sure that whenever scale is zero, we handle it correctly.
512-
513-
This happens in most scalers when we have constant features.'''
514-
# if we are fitting on 1D arrays, scale might be a scalar
515-
if np.isscalar(scale):
516-
if scale == 0:
517-
scale = 1.
518-
elif isinstance(scale, np.ndarray):
519-
scale[scale == 0.0] = 1.0
520-
scale[~np.isfinite(scale)] = 1.0
521-
return scale
522-
523634
def fit(self, X, y=None):
524635
"""Compute the median and quantiles to be used for scaling.
525636
@@ -539,12 +650,7 @@ def fit(self, X, y=None):
539650
if self.with_scaling:
540651
q = np.percentile(X, (25, 75), axis=0)
541652
self.scale_ = (q[1] - q[0])
542-
if np.isscalar(self.scale_):
543-
if self.scale_ == 0:
544-
self.scale_ = 1.
545-
else:
546-
self.scale_[self.scale_ == 0.0] = 1.0
547-
self.scale_[~np.isfinite(self.scale_)] = 1.0
653+
self.scale_ = _handle_zeros_in_scale(self.scale_)
548654
return self
549655

550656
def transform(self, X, y=None):

0 commit comments

Comments
 (0)