Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9f28d10

Browse files
committed
ENH add MaxAbsScaler
1 parent bdef419 commit 9f28d10

File tree

4 files changed

+264
-52
lines changed

4 files changed

+264
-52
lines changed

doc/modules/preprocessing.rst

Lines changed: 59 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,10 @@ Scaling features to a range
102102
---------------------------
103103

104104
An alternative standardization is scaling features to
105-
lie between a given minimum and maximum value, often between zero and one.
106-
This can be achieved using :class:`MinMaxScaler`.
105+
lie between a given minimum and maximum value, often between zero and one,
106+
or so that the maximum absolute value of each feature is scaled to unit size.
107+
This can be achieved using :class:`MinMaxScaler` or :class:`MaxAbsScaler`,
108+
respectively.
107109

108110
The motivation to use this scaling include robustness to very small
109111
standard deviations of features and preserving zero entries in sparse data.
@@ -146,6 +148,61 @@ full formula is::
146148

147149
X_scaled = X_std / (max - min) + min
148150

151+
:class:`MaxAbsScaler` works in a very similar fashion, but scales data so
152+
it lies within the range ``[-1, 1]``, and is meant for data
153+
that is already centered at zero or sparse data.
154+
155+
Here is how to use the toy data from the previous example with this scaler::
156+
157+
>>> X_train = np.array([[ 1., -1., 2.],
158+
... [ 2., 0., 0.],
159+
... [ 0., 1., -1.]])
160+
...
161+
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
162+
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
163+
>>> X_train_maxabs # doctest +NORMALIZE_WHITESPACE^
164+
array([[ 0.5, -1. , 1. ],
165+
[ 1. , 0. , 0. ],
166+
[ 0. , 1. , -0.5]])
167+
>>> X_test = np.array([[ -3., -1., 4.]])
168+
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
169+
>>> X_test_maxabs # doctest: +NORMALIZE_WHITESPACE
170+
array([[-1.5, -1. , 2. ]])
171+
>>> max_abs_scaler.scale_ # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
172+
array([ 2., 1., 2.])
173+
174+
175+
As with :func:`scale`, the module further provides a
176+
convenience function :func:`maxabs_scale` if you don't want to
177+
create an object.
178+
179+
180+
Scaling sparse data
181+
-------------------
182+
Centering sparse data would destroy the sparseness structure in the data, and
183+
thus rarely is a sensible thing to do. However, it can make sense to scale
184+
sparse inputs, especially if features are on different scales.
185+
186+
:class:`MaxAbsScaler` and :func:`maxabs_scale` were specifically designed
187+
for scaling sparse data, and are the recommended way to go about this.
188+
However, :func:`scale` and :class:`StandardScaler` can accept ``scipy.sparse``
189+
matrices as input, as long as ``with_centering=False`` is explicitly passed
190+
to the constructor. Otherwise a ``ValueError`` will be raised as
191+
silently centering would break the sparsity and would often crash the
192+
execution by allocating excessive amounts of memory unintentionally.
193+
:class:`RobustScaler` cannot be fited to sparse inputs, but you can use
194+
the ``transform`` method on sparse inputs.
195+
196+
Note that the scalers accept both Compressed Sparse Rows and Compressed
197+
Sparse Columns format (see ``scipy.sparse.csr_matrix`` and
198+
``scipy.sparse.csc_matrix``). Any other sparse input will be **converted to
199+
the Compressed Sparse Rows representation**. To avoid unnecessary memory
200+
copies, it is recommended to choose the CSR or CSC representation upstream.
201+
202+
Finally, if the centered data is expected to be small enough, explicitly
203+
converting the input to an array using the ``toarray`` method of sparse matrices
204+
is another option.
205+
149206

150207
Scaling data with outliers
151208
--------------------------
@@ -173,23 +230,6 @@ data.
173230
or :class:`sklearn.decomposition.RandomizedPCA` with ``whiten=True``
174231
to further remove the linear correlation across features.
175232

176-
.. topic:: Sparse input
177-
178-
:func:`scale` and :class:`StandardScaler` accept ``scipy.sparse`` matrices
179-
as input **only when with_mean=False is explicitly passed to the
180-
constructor**. Otherwise a ``ValueError`` will be raised as
181-
silently centering would break the sparsity and would often crash the
182-
execution by allocating excessive amounts of memory unintentionally.
183-
184-
If the centered data is expected to be small enough, explicitly convert
185-
the input to an array using the ``toarray`` method of sparse matrices
186-
instead.
187-
188-
For sparse input the data is **converted to the Compressed Sparse Rows
189-
representation** (see ``scipy.sparse.csr_matrix``).
190-
To avoid unnecessary memory copies, it is recommended to choose the CSR
191-
representation upstream.
192-
193233
.. topic:: Scaling target variables in regression
194234

195235
:func:`scale` and :class:`StandardScaler` work out-of-the-box with 1d arrays.

sklearn/preprocessing/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from .data import Binarizer
77
from .data import KernelCenterer
88
from .data import MinMaxScaler
9+
from .data import MaxAbsScaler
910
from .data import Normalizer
1011
from .data import RobustScaler
1112
from .data import StandardScaler
@@ -14,6 +15,7 @@
1415
from .data import normalize
1516
from .data import scale
1617
from .data import robust_scale
18+
from .data import maxabs_scale
1719
from .data import OneHotEncoder
1820

1921
from .data import PolynomialFeatures
@@ -33,6 +35,7 @@
3335
'LabelEncoder',
3436
'MultiLabelBinarizer',
3537
'MinMaxScaler',
38+
'MaxAbsScaler',
3639
'Normalizer',
3740
'OneHotEncoder',
3841
'RobustScaler',
@@ -43,5 +46,6 @@
4346
'normalize',
4447
'scale',
4548
'robust_scale',
49+
'maxabs_scale',
4650
'label_binarize',
4751
]

sklearn/preprocessing/data.py

Lines changed: 143 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
'Binarizer',
3333
'KernelCenterer',
3434
'MinMaxScaler',
35+
'MaxAbsScaler',
3536
'Normalizer',
3637
'OneHotEncoder',
3738
'RobustScaler',
@@ -41,6 +42,7 @@
4142
'normalize',
4243
'scale',
4344
'robust_scale',
45+
'maxabs_scale',
4446
]
4547

4648

@@ -59,16 +61,28 @@ def _mean_and_std(X, axis=0, with_mean=True, with_std=True):
5961

6062
if with_std:
6163
std_ = Xr.std(axis=0)
62-
if isinstance(std_, np.ndarray):
63-
std_[std_ == 0.] = 1.0
64-
elif std_ == 0.:
65-
std_ = 1.
64+
std_ = _handle_zeros_in_scale(std_)
6665
else:
6766
std_ = None
6867

6968
return mean_, std_
7069

7170

71+
def _handle_zeros_in_scale(scale):
72+
''' Makes sure that whenever scale is zero, we handle it correctly.
73+
74+
This happens in most scalers when we have constant features.'''
75+
76+
# if we are fitting on 1D arrays, scale might be a scalar
77+
if np.isscalar(scale):
78+
if scale == 0:
79+
scale = 1.
80+
elif isinstance(scale, np.ndarray):
81+
scale[scale == 0.0] = 1.0
82+
scale[~np.isfinite(scale)] = 1.0
83+
return scale
84+
85+
7286
def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
7387
"""Standardize a dataset along any axis
7488
@@ -132,7 +146,7 @@ def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
132146
if copy:
133147
X = X.copy()
134148
_, var = mean_variance_axis(X, axis=0)
135-
var[var == 0.0] = 1.0
149+
var = _handle_zeros_in_scale(var)
136150
inplace_column_scale(X, 1 / np.sqrt(var))
137151
else:
138152
X = np.asarray(X)
@@ -233,11 +247,7 @@ def fit(self, X, y=None):
233247
" than maximum. Got %s." % str(feature_range))
234248
data_min = np.min(X, axis=0)
235249
data_range = np.max(X, axis=0) - data_min
236-
# Do not scale constant features
237-
if isinstance(data_range, np.ndarray):
238-
data_range[data_range == 0.0] = 1.0
239-
elif data_range == 0.:
240-
data_range = 1.
250+
data_range = _handle_zeros_in_scale(data_range)
241251
self.scale_ = (feature_range[1] - feature_range[0]) / data_range
242252
self.min_ = feature_range[0] - data_min * self.scale_
243253
self.data_range = data_range
@@ -359,7 +369,7 @@ def fit(self, X, y=None):
359369
if self.with_std:
360370
var = mean_variance_axis(X, axis=0)[1]
361371
self.std_ = np.sqrt(var)
362-
self.std_[var == 0.0] = 1.0
372+
self.std_ = _handle_zeros_in_scale(self.std_)
363373
else:
364374
self.std_ = None
365375
return self
@@ -430,6 +440,124 @@ def inverse_transform(self, X, copy=None):
430440
return X
431441

432442

443+
class MaxAbsScaler(BaseEstimator, TransformerMixin):
444+
"""Scale each feature by its maximum absolute value.
445+
446+
This estimator scales and translates each feature individually such
447+
that the maximal absolute value of each feature in the
448+
training set will be 1.0. It does not shift/center the data, and
449+
thus does not destroy any sparsity.
450+
451+
This scaler can also be applied to sparse CSR or CSC matrices.
452+
453+
Parameters
454+
----------
455+
copy : boolean, optional, default is True
456+
Set to False to perform inplace scaling and avoid a copy (if the input
457+
is already a numpy array).
458+
459+
Attributes
460+
----------
461+
`scale_` : ndarray, shape (n_features,)
462+
Per feature relative scaling of the data.
463+
"""
464+
465+
def __init__(self, copy=True):
466+
self.copy = copy
467+
468+
def fit(self, X, y=None):
469+
"""Compute the minimum and maximum to be used for later scaling.
470+
471+
Parameters
472+
----------
473+
X : array-like, shape [n_samples, n_features]
474+
The data used to compute the per-feature minimum and maximum
475+
used for later scaling along the features axis.
476+
"""
477+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
478+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
479+
if sparse.issparse(X):
480+
mins, maxs = min_max_axis(X, axis=0)
481+
scales = np.maximum(np.abs(mins), np.abs(maxs))
482+
else:
483+
scales = np.abs(X).max(axis=0)
484+
scales = np.array(scales)
485+
scales = scales.reshape(-1)
486+
self.scale_ = _handle_zeros_in_scale(scales)
487+
return self
488+
489+
def transform(self, X, y=None):
490+
"""Scale the data
491+
492+
Parameters
493+
----------
494+
X : array-like or CSR matrix.
495+
The data that should be scaled.
496+
"""
497+
check_is_fitted(self, 'scale_')
498+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
499+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
500+
if sparse.issparse(X):
501+
if X.shape[0] == 1:
502+
inplace_row_scale(X, 1.0 / self.scale_)
503+
else:
504+
inplace_column_scale(X, 1.0 / self.scale_)
505+
else:
506+
X /= self.scale_
507+
return X
508+
509+
def inverse_transform(self, X):
510+
"""Scale back the data to the original representation
511+
512+
Parameters
513+
----------
514+
X : array-like or CSR matrix.
515+
The data that should be transformed back.
516+
"""
517+
check_is_fitted(self, 'scale_')
518+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
519+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
520+
if sparse.issparse(X):
521+
if X.shape[0] == 1:
522+
inplace_row_scale(X, self.scale_)
523+
else:
524+
inplace_column_scale(X, self.scale_)
525+
else:
526+
X *= self.scale_
527+
return X
528+
529+
530+
def maxabs_scale(X, axis=0, copy=True):
531+
"""Scale each feature to the [-1, 1] range without breaking the sparsity.
532+
533+
This estimator scales each feature individually such
534+
that the maximal absolute value of each feature in the
535+
training set will be 1.0.
536+
537+
This scaler can also be applied to sparse CSR or CSC matrices.
538+
539+
Parameters
540+
----------
541+
axis : int (0 by default)
542+
axis used to scale along. If 0, independently scale each feature,
543+
otherwise (if 1) scale each sample.
544+
545+
copy : boolean, optional, default is True
546+
Set to False to perform inplace scaling and avoid a copy (if the input
547+
is already a numpy array).
548+
549+
Attributes
550+
----------
551+
`scale_` : ndarray, shape (n_features,)
552+
Per feature relative scaling of the data.
553+
"""
554+
s = MaxAbsScaler(copy=copy)
555+
if axis == 0:
556+
return s.fit_transform(X)
557+
else:
558+
return s.fit_transform(X.T).T
559+
560+
433561
class RobustScaler(BaseEstimator, TransformerMixin):
434562
"""Scale features using statistics that are robust to outliers.
435563
@@ -498,28 +626,15 @@ def __init__(self, with_centering=True, with_scaling=True, copy=True):
498626

499627
def _check_array(self, X, copy):
500628
"""Makes sure centering is not enabled for sparse matrices."""
501-
X = check_array(X, accept_sparse=('csr', 'csc'), dtype=np.float,
502-
copy=copy, ensure_2d=False)
629+
X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
630+
ensure_2d=False, estimator=self, dtype=FLOAT_DTYPES)
503631
if sparse.issparse(X):
504632
if self.with_centering:
505633
raise ValueError(
506634
"Cannot center sparse matrices: use `with_centering=False`"
507635
" instead. See docstring for motivation and alternatives.")
508636
return X
509637

510-
def _handle_zeros_in_scale(self, scale):
511-
''' Makes sure that whenever scale is zero, we handle it correctly.
512-
513-
This happens in most scalers when we have constant features.'''
514-
# if we are fitting on 1D arrays, scale might be a scalar
515-
if np.isscalar(scale):
516-
if scale == 0:
517-
scale = 1.
518-
elif isinstance(scale, np.ndarray):
519-
scale[scale == 0.0] = 1.0
520-
scale[~np.isfinite(scale)] = 1.0
521-
return scale
522-
523638
def fit(self, X, y=None):
524639
"""Compute the median and quantiles to be used for scaling.
525640
@@ -539,12 +654,7 @@ def fit(self, X, y=None):
539654
if self.with_scaling:
540655
q = np.percentile(X, (25, 75), axis=0)
541656
self.scale_ = (q[1] - q[0])
542-
if np.isscalar(self.scale_):
543-
if self.scale_ == 0:
544-
self.scale_ = 1.
545-
else:
546-
self.scale_[self.scale_ == 0.0] = 1.0
547-
self.scale_[~np.isfinite(self.scale_)] = 1.0
657+
self.scale_ = _handle_zeros_in_scale(self.scale_)
548658
return self
549659

550660
def transform(self, X, y=None):
@@ -847,7 +957,7 @@ def normalize(X, norm='l2', axis=1, copy=True):
847957
norms = row_norms(X)
848958
elif norm == 'max':
849959
norms = np.max(X, axis=1)
850-
norms[norms == 0.0] = 1.0
960+
norms = _handle_zeros_in_scale(norms)
851961
X /= norms[:, np.newaxis]
852962

853963
if axis == 0:

0 commit comments

Comments
 (0)