Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Add Yeo-Johnson transform to PowerTransformer #11520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jul 20, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
06891eb
WIP - First draft on Yeo-Johnson transform
NicolasHug Jul 14, 2018
a88d168
Fixed lambda param optimization
NicolasHug Jul 14, 2018
ee09d7f
Some first tests
NicolasHug Jul 15, 2018
aea0842
Put helper method for yeo-johnson at the end
NicolasHug Jul 15, 2018
fba12eb
Added inverse transform + some tests
NicolasHug Jul 15, 2018
ed5a411
Added test for the optimization procedures
NicolasHug Jul 15, 2018
8bab32e
Created _box_cox_optimize method for better code symmetry
NicolasHug Jul 15, 2018
0525bab
Opt for yeo-johnson not influenced by Nan
NicolasHug Jul 15, 2018
8e187c4
Added doc
NicolasHug Jul 15, 2018
4173df3
Better test for nan in transform()
NicolasHug Jul 15, 2018
61e2183
Updated more docs and example
NicolasHug Jul 15, 2018
b1ac8d4
updated test
NicolasHug Jul 15, 2018
489bc70
Modified tests according to reviews
NicolasHug Jul 15, 2018
6783e3a
Changed default method from cox-box to yeo-johnson
NicolasHug Jul 15, 2018
dfd1ecc
Addressed most comments from @glemaitre, fixed flake8
NicolasHug Jul 16, 2018
78169f6
Removed box-cox specific checks in estimator_checks
NicolasHug Jul 16, 2018
f48a17b
More explicit variable names for mean and variance
NicolasHug Jul 16, 2018
67eaa98
Addressed comments from glemaitre
NicolasHug Jul 16, 2018
7a2bce7
Changed number of bins in plots to auto
NicolasHug Jul 16, 2018
2b56a9d
Fixed Nan issues (ignored warnings)
NicolasHug Jul 16, 2018
e928d26
Fixed docstring example issue
NicolasHug Jul 16, 2018
948ed2a
Merge branch 'master' into yeojohnson
NicolasHug Jul 16, 2018
5273212
Updated whatsnew
NicolasHug Jul 16, 2018
5efccdf
Addressed comments from glemaitre
NicolasHug Jul 16, 2018
0a10984
Merge branch 'master' into yeojohnson
NicolasHug Jul 16, 2018
0c543e3
Fixed comment issue
NicolasHug Jul 16, 2018
a0d86a0
Updated example
NicolasHug Jul 16, 2018
7b18937
fixed minor typos
NicolasHug Jul 16, 2018
800a2c2
Updated comment in whatsnew following ogrisel comments
NicolasHug Jul 16, 2018
53afa9f
Renamed plot example
NicolasHug Jul 16, 2018
a0d97ee
Should fix test for Python 2.7
NicolasHug Jul 17, 2018
23c3ddd
Should fix example plot
NicolasHug Jul 17, 2018
0c3b268
Addressed comments from TomDLT
NicolasHug Jul 17, 2018
7be0376
OPTIM implement fit_transform
ogrisel Jul 17, 2018
420476c
Added test fit_transform() == fit().transform()
NicolasHug Jul 17, 2018
1287f94
Added tests for the copy parameter
NicolasHug Jul 17, 2018
0ce4b36
Fixed flake8 issues in example plot
NicolasHug Jul 17, 2018
597a85d
set copy to False for the scaler
NicolasHug Jul 17, 2018
593c818
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
NicolasHug Jul 17, 2018
8022cc3
Addressed comments from glemaitre
NicolasHug Jul 17, 2018
0c120fb
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
NicolasHug Jul 18, 2018
8234a3e
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
NicolasHug Jul 19, 2018
7d529df
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
NicolasHug Jul 20, 2018
c0a01df
Updated plot_all_scaling.py example
NicolasHug Jul 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ General Concepts
convergence of the training loss, to avoid over-fitting. This is
generally done by monitoring the generalization score on a validation
set. When available, it is activated through the parameter
``early_stopping`` or by setting a postive :term:`n_iter_no_change`.
``early_stopping`` or by setting a positive :term:`n_iter_no_change`.

estimator instance
We sometimes use this terminology to distinguish an :term:`estimator`
Expand Down
45 changes: 30 additions & 15 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,20 +309,34 @@ Power transforms are a family of parametric, monotonic transformations that aim
to map data from any distribution to as close to a Gaussian distribution as
possible in order to stabilize variance and minimize skewness.

:class:`PowerTransformer` currently provides one such power transformation,
the Box-Cox transform. The Box-Cox transform is given by:
:class:`PowerTransformer` currently provides two such power transformations,
the Yeo-Johnson transform and the Box-Cox transform.

The Yeo-Johnson transform is given by:

.. math::
y_i^{(\lambda)} =
x_i^{(\lambda)} =
\begin{cases}
\dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt]
\ln{(y_i)} & \text{if } \lambda = 0,
[(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \neq 0, x_i \geq 0, \\[8pt]
\ln{(x_i) + 1} & \text{if } \lambda = 0, x_i \geq 0 \\[8pt]
-[(-x_i + 1)^{2 - \lambda} - 1] / (2 - \lambda) & \text{if } \lambda \neq 2, x_i < 0, \\[8pt]
- \ln (- x_i + 1) & \text{if } \lambda = 2, x_i < 0
\end{cases}

Box-Cox can only be applied to strictly positive data. The transformation is
parameterized by :math:`\lambda`, which is determined through maximum likelihood
estimation. Here is an example of using Box-Cox to map samples drawn from a
lognormal distribution to a normal distribution::
while the Box-Cox transform is given by:

.. math::
x_i^{(\lambda)} =
\begin{cases}
\dfrac{x_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt]
\ln{(x_i)} & \text{if } \lambda = 0,
\end{cases}


Box-Cox can only be applied to strictly positive data. In both methods, the
transformation is parameterized by :math:`\lambda`, which is determined through
maximum likelihood estimation. Here is an example of using Box-Cox to map
samples drawn from a lognormal distribution to a normal distribution::

>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
>>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))
Expand All @@ -339,13 +353,14 @@ While the above example sets the `standardize` option to `False`,
:class:`PowerTransformer` will apply zero-mean, unit-variance normalization
to the transformed output by default.

Below are examples of Box-Cox applied to various probability distributions.
Note that when applied to certain distributions, Box-Cox achieves very
Gaussian-like results, but with others, it is ineffective. This highlights
the importance of visualizing the data before and after transformation.
Below are examples of Box-Cox and Yeo-Johnson applied to various probability
distributions. Note that when applied to certain distributions, the power
transforms achieve very Gaussian-like results, but with others, they are
ineffective. This highlights the importance of visualizing the data before and
after transformation.

.. figure:: ../auto_examples/preprocessing/images/sphx_glr_plot_power_transformer_001.png
:target: ../auto_examples/preprocessing/plot_power_transformer.html
.. figure:: ../auto_examples/preprocessing/images/sphx_glr_plot_map_data_to_normal_001.png
:target: ../auto_examples/preprocessing/plot_map_data_to_normal.html
:align: center
:scale: 100

Expand Down
13 changes: 8 additions & 5 deletions doc/whats_new/v0.20.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,12 +136,15 @@ Preprocessing
DataFrames. :issue:`9012` by `Andreas Müller`_ and `Joris Van den Bossche`_,
and :issue:`11315` by :user:`Thomas Fan <thomasjpfan>`.

- Added :class:`preprocessing.PowerTransformer`, which implements the Box-Cox
power transformation, allowing users to map data from any distribution to a
Gaussian distribution. This is useful as a variance-stabilizing transformation
in situations where normality and homoscedasticity are desirable.
- Added :class:`preprocessing.PowerTransformer`, which implements the
Yeo-Johnson and Box-Cox power transformations. Power transformations try to
find a set of feature-wise parametric transformations to approximately map
data to a Gaussian distribution centered at zero and with unit variance.
This is useful as a variance-stabilizing transformation in situations where
normality and homoscedasticity are desirable.
:issue:`10210` by :user:`Eric Chang <ericchang00>` and
:user:`Maniteja Nandana <maniteja123>`.
:user:`Maniteja Nandana <maniteja123>`, and :issue:`11520` by :user:`Nicolas
Hug <nicolashug>`.

- Added the :class:`compose.TransformedTargetRegressor` which transforms
the target y before fitting a regression model. The predictions are mapped
Expand Down
32 changes: 17 additions & 15 deletions examples/preprocessing/plot_all_scaling.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@
MaxAbsScaler().fit_transform(X)),
('Data after robust scaling',
RobustScaler(quantile_range=(25, 75)).fit_transform(X)),
('Data after power transformation (Yeo-Johnson)',
PowerTransformer(method='yeo-johnson').fit_transform(X)),
('Data after power transformation (Box-Cox)',
PowerTransformer(method='box-cox').fit_transform(X)),
('Data after quantile transformation (gaussian pdf)',
Expand Down Expand Up @@ -294,21 +296,21 @@ def make_plot(item_idx):
make_plot(4)

##############################################################################
# PowerTransformer (Box-Cox)
# --------------------------
# PowerTransformer
# ----------------
#
# ``PowerTransformer`` applies a power transformation to each
# feature to make the data more Gaussian-like. Currently,
# ``PowerTransformer`` implements the Box-Cox transform. The Box-Cox transform
# finds the optimal scaling factor to stabilize variance and mimimize skewness
# through maximum likelihood estimation. By default, ``PowerTransformer`` also
# applies zero-mean, unit variance normalization to the transformed output.
# Note that Box-Cox can only be applied to positive, non-zero data. Income and
# number of households happen to be strictly positive, but if negative values
# are present, a constant can be added to each feature to shift it into the
# positive range - this is known as the two-parameter Box-Cox transform.
# ``PowerTransformer`` applies a power transformation to each feature to make
# the data more Gaussian-like. Currently, ``PowerTransformer`` implements the
# Yeo-Johnson and Box-Cox transforms. The power transform finds the optimal
# scaling factor to stabilize variance and mimimize skewness through maximum
# likelihood estimation. By default, ``PowerTransformer`` also applies
# zero-mean, unit variance normalization to the transformed output. Note that
# Box-Cox can only be applied to strictly positive data. Income and number of
# households happen to be strictly positive, but if negative values are present
# the Yeo-Johnson transformed is to be preferred.

make_plot(5)
make_plot(6)

##############################################################################
# QuantileTransformer (Gaussian output)
Expand All @@ -319,7 +321,7 @@ def make_plot(item_idx):
# Note that this non-parametetric transformer introduces saturation artifacts
# for extreme values.

make_plot(6)
make_plot(7)

###################################################################
# QuantileTransformer (uniform output)
Expand All @@ -337,7 +339,7 @@ def make_plot(item_idx):
# any outlier by setting them to the a priori defined range boundaries (0 and
# 1).

make_plot(7)
make_plot(8)

##############################################################################
# Normalizer
Expand All @@ -350,6 +352,6 @@ def make_plot(item_idx):
# transformed data only lie in the positive quadrant. This would not be the
# case if some original features had a mix of positive and negative values.

make_plot(8)
make_plot(9)

plt.show()
137 changes: 137 additions & 0 deletions examples/preprocessing/plot_map_data_to_normal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
"""
=================================
Map data to a normal distribution
=================================

This example demonstrates the use of the Box-Cox and Yeo-Johnson transforms
through :class:`preprocessing.PowerTransformer` to map data from various
distributions to a normal distribution.

The power transform is useful as a transformation in modeling problems where
homoscedasticity and normality are desired. Below are examples of Box-Cox and
Yeo-Johnwon applied to six different probability distributions: Lognormal,
Chi-squared, Weibull, Gaussian, Uniform, and Bimodal.

Note that the transformations successfully map the data to a normal
distribution when applied to certain datasets, but are ineffective with others.
This highlights the importance of visualizing the data before and after
transformation.

Also note that even though Box-Cox seems to perform better than Yeo-Johnson for
lognormal and chi-squared distributions, keep in mind that Box-Cox does not
support inputs with negative values.

For comparison, we also add the output from
:class:`preprocessing.QuantileTransformer`. It can force any arbitrary
distribution into a gaussian, provided that there are enough training samples
(thousands). Because it is a non-parametric method, it is harder to interpret
than the parametric ones (Box-Cox and Yeo-Johnson).

On "small" datasets (less than a few hundred points), the quantile transformer
is prone to overfitting. The use of the power transform is then recommended.
"""

# Author: Eric Chang <[email protected]>
# Nicolas Hug <[email protected]>
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
from sklearn.model_selection import train_test_split

print(__doc__)


N_SAMPLES = 1000
FONT_SIZE = 6
BINS = 30


rng = np.random.RandomState(304)
bc = PowerTransformer(method='box-cox')
yj = PowerTransformer(method='yeo-johnson')
qt = QuantileTransformer(output_distribution='normal', random_state=rng)
size = (N_SAMPLES, 1)


# lognormal distribution
X_lognormal = rng.lognormal(size=size)

# chi-squared distribution
df = 3
X_chisq = rng.chisquare(df=df, size=size)

# weibull distribution
a = 50
X_weibull = rng.weibull(a=a, size=size)

# gaussian distribution
loc = 100
X_gaussian = rng.normal(loc=loc, size=size)

# uniform distribution
X_uniform = rng.uniform(low=0, high=1, size=size)

# bimodal distribution
loc_a, loc_b = 100, 105
X_a, X_b = rng.normal(loc=loc_a, size=size), rng.normal(loc=loc_b, size=size)
X_bimodal = np.concatenate([X_a, X_b], axis=0)


# create plots
distributions = [
('Lognormal', X_lognormal),
('Chi-squared', X_chisq),
('Weibull', X_weibull),
('Gaussian', X_gaussian),
('Uniform', X_uniform),
('Bimodal', X_bimodal)
]

colors = ['firebrick', 'darkorange', 'goldenrod',
'seagreen', 'royalblue', 'darkorchid']

fig, axes = plt.subplots(nrows=8, ncols=3, figsize=plt.figaspect(2))
axes = axes.flatten()
axes_idxs = [(0, 3, 6, 9), (1, 4, 7, 10), (2, 5, 8, 11), (12, 15, 18, 21),
(13, 16, 19, 22), (14, 17, 20, 23)]
axes_list = [(axes[i], axes[j], axes[k], axes[l])
for (i, j, k, l) in axes_idxs]


for distribution, color, axes in zip(distributions, colors, axes_list):
name, X = distribution
X_train, X_test = train_test_split(X, test_size=.5)

# perform power transforms and quantile transform
X_trans_bc = bc.fit(X_train).transform(X_test)
lmbda_bc = round(bc.lambdas_[0], 2)
X_trans_yj = yj.fit(X_train).transform(X_test)
lmbda_yj = round(yj.lambdas_[0], 2)
X_trans_qt = qt.fit(X_train).transform(X_test)

ax_original, ax_bc, ax_yj, ax_qt = axes

ax_original.hist(X_train, color=color, bins=BINS)
ax_original.set_title(name, fontsize=FONT_SIZE)
ax_original.tick_params(axis='both', which='major', labelsize=FONT_SIZE)

for ax, X_trans, meth_name, lmbda in zip(
(ax_bc, ax_yj, ax_qt),
(X_trans_bc, X_trans_yj, X_trans_qt),
('Box-Cox', 'Yeo-Johnson', 'Quantile transform'),
(lmbda_bc, lmbda_yj, None)):
ax.hist(X_trans, color=color, bins=BINS)
title = 'After {}'.format(meth_name)
if lmbda is not None:
title += '\n$\lambda$ = {}'.format(lmbda)
ax.set_title(title, fontsize=FONT_SIZE)
ax.tick_params(axis='both', which='major', labelsize=FONT_SIZE)
ax.set_xlim([-3.5, 3.5])


plt.tight_layout()
plt.show()
Loading