[MRG] ENH: Add boxcox transform to preprocess input data #6781

maniteja123 · 2016-05-14T13:12:33Z

Reference Issue

What does this implement/fix? Explain your changes.

Implement boxcox transform. The current approach is apply univariate on each feature with lambda being evaluated for maximising the log likelihood

Any other comments?

This is just my initial attempt and am not sure if it is supposed to be computed in this manner. The documentation and tests need to be made better but first wanted to ask if this the expected functionality. Please do give any suggestions on improving this. Thanks.

This change is

agramfort · 2016-05-15T09:53:49Z

sklearn/preprocessing/data.py

+    Royal Statistical Society B, 26, 211-252 (1964).
+    """
+    X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES)
+    if any(np.any(X<=0, axis=0)):


agramfort · 2016-05-15T09:55:43Z

Can you also add a transformer class? It should accept as param the indices of the features to transform

maniteja123 · 2016-05-15T10:04:44Z

Yeah sure. Can you please tell if the API should be on the lines of Normalizer ? And what should be the name of parameter specifying the features to be transformed? Also should I also add the option to provide the features to function boxcox_transform. Sorry for so many doubts but wanted to make sure I am proceeding in the right direction. Thanks.

agramfort · 2016-05-15T10:50:15Z

Yeah sure. Can you please tell if the API should be on the lines of
Normalizer ?

yes with just a copy param/

And what should be the name of parameter specifying the features to be
transformed? Also should I also add the option to provide the features to
function boxcox_transform. Sorry for so many doubts but wanted to make
sure I am proceeding in the right direction. Thanks.

see if it has been done in other preprocessing estimators. I would call it
feature_indices=None | array-like of int

raghavrv · 2016-05-15T13:21:22Z

sklearn/preprocessing/data.py

                                   self.categorical_features, copy=True)
+
+
+def _boxcox(X):


Do we need this? Why can't we just do a stats.boxcox(X)[0] at the required place(s)?

I was was hoping to use the Parallel helper here. And couldn't figure how to take only a part of the input while using it. If that is an overkill, will make it a vectorise loop like everywhere else.

Ah I see you need to pass this for the parallel computation.

But I think you could pass the stats.boxcox as such and extract the desired output in one go? Would be cleaner? No strong opinion though.

keep it simple. It's 3 lines anyway

Could you please provide how to achieve that since I couldn't find a way to
extract the first part of the tuple output?
On 15 May 2016 7:29 pm, "Raghav R V" [email protected] wrote:

In sklearn/preprocessing/data.py
#6781 (comment)
:

@@ -1917,3 +1920,38 @@ def transform(self, X):
"""
return _transform_selected(X, self._transform,
self.categorical_features, copy=True)
+
+
+def _boxcox(X):

But I think you could pass the stats.boxcox as such and extract the
desired output in one go? Would be cleaner? No strong opinion though.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6781/files/14f14c70b82d373a4a6b7f28356bebd1d17b5f56#r63292049

keep it simple.

Indeed. @maniteja123 nvm... Leave it as such. (Just incase you are curious, you could zip(*output)[0])

Yeah sure. Thanks for the tip.

maniteja123 · 2016-05-15T13:58:30Z

There is a function _transform_selected which is used in OneHotEncoder to apply transform on selected features but from what I can understand it transforms those features and stacks it with the untransformed features. But the relative positions of the features must be preserved right ?

raghavrv · 2016-05-15T14:10:26Z

sklearn/preprocessing/data.py

+    X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES)
+    if any(np.any(X<=0, axis=0)):
+        raise ValueError("BoxCox transform can only be applied on positive data")
+    n_samples, n_features = X.shape


_, n_features

GaelVaroquaux · 2016-05-16T05:16:47Z

sklearn/preprocessing/data.py

+    if any(np.any(X<=0, axis=0)):
+        raise ValueError("BoxCox transform can only be applied on positive data")
+    n_samples, n_features = X.shape
+    outputs = Parallel()(delayed(_boxcox)(X[:, i]) for i in range(n_features))


You need to specify the number of jobs to Parallel, elsewhere is runs in series.

Thanks will change this.

GaelVaroquaux · 2016-05-16T05:22:23Z

Thanks for the PR. You'll need a transformer that learns the parameters of the boxcox transform during fit, and that applies it to new data during transform.

You'll also need an example. For instance, using the Boston Housing data, where you learn the transform on half of the data, and you apply it to the other half, and you show histograms of the features before and after the transform.

maniteja123 · 2016-05-16T11:36:16Z

I have one more question. I was using the _transform_selected to perform the transform on selected features. But if we have to use the lambdas learnt from fit, this helper function can't be used because it doesn't provide the option to pass any additional parameters to the transform function. Would it be okay to replicate the code of boxcox function in the BoxCoxTransformer so that it can access the lambdas learnt from fit ?

GaelVaroquaux · 2016-05-16T11:39:44Z

You can pass lambda to the boxplot function of scipy.

maniteja123 · 2016-05-16T11:44:55Z

Sorry if I wasn't clear in my question. I was wondering how to pass it to the boxcox function here since it is a separate function and can't access the lambdas learnt from fit.

GaelVaroquaux · 2016-05-16T11:48:55Z

Store the lambdas learned from fit as an attribute of the transformer.

maniteja123 · 2016-05-16T16:09:41Z

Sorry for so many questions. But for the boxcox transform, the input should be positive data but the boston dataset has some zero values. And also could you please elaborate on "histograms of features" ? Thanks.

GaelVaroquaux · 2016-05-16T16:22:34Z

But for the boxcox transform, the input should be positive data but the boston dataset has some zero values.

Good point. Could you apply it only on the features that don't have zeros.

And also could you please elaborate on "histograms of features" ?

My idea was to plot the marginal distribution for the features before and after transformation.

maniteja123 · 2016-05-16T16:31:44Z

Thanks will try that. There are only two features with zero as a value. And IIUC, should plot the marginal distribution for each of the 11 features right ?

maniteja123 · 2016-05-17T13:20:56Z

I have tried plotting it for feature 1. Please do have a look and let me know if it is the expected plot. Thanks.

GaelVaroquaux · 2016-05-17T14:09:00Z

It's really not what I was expecting. I was expecting the distribution after the box cox to look more normal, like on https://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/

In addition, you have a failing test:
https://travis-ci.org/scikit-learn/scikit-learn/jobs/130406967

maniteja123 · 2016-05-17T14:24:57Z

Sorry I chose a bad feature. Would something like this be better, though the plot doesn't exactly looks like normal after the transform. I will push the code so that it is possible to check if I am doing it wrong. Also the common tests are failing because test_non_meta_estimators is giving data with non-positive values. What should be done here ? Thanks

maniteja123 · 2016-05-19T11:59:04Z

Sorry for pinging again, but this was the best feature which showed some amount of transformation ( at least visually). Would it be better to try shuffling the datasets and then do the transformation or try on any other dataset ? Thanks.

agramfort · 2016-05-21T07:33:07Z

sklearn/preprocessing/data.py

+        n_features = X.shape[1]
+        outputs = Parallel(n_jobs=-1)(delayed(_boxcox)(X[:, i], None, True) for i in range(n_features))
+        output = np.concatenate([o[0][..., np.newaxis] for o in outputs], axis=1)
+        self.lambdas = np.array([o[1] for o in outputs])


estimated attributes should end with _ So:

self.lambdas_ = np.array([o[1] for o in outputs])

hlin117 · 2017-03-01T06:50:36Z

sklearn/preprocessing/data.py



-def _transform_selected(X, transform, selected="all", copy=True):
+def _transform_selected(X, transform, selected="all", copy=True, order=False):


In certain functions like check_array, order is a keyword argument which describes how the array is ordered.

I'd instead rename this parameter to something like retain_ordering.

Separate issue: now that other classes from modules outside of data.py are using this function, maybe we should create a utils.py and place transform_selected in it.

Might add that as a todo for a future pull request...

hlin117 · 2017-03-01T06:52:24Z

sklearn/preprocessing/data.py

+    X : array-like, shape (n_samples, n_features)
+        The data to be transformed. Should contain only positive data.
+
+    copy : boolean, optional, default is True


To retain consistency in this file's docs, I'd say default=True.

hlin117 · 2017-03-01T06:52:37Z

sklearn/preprocessing/data.py

+        The data to be transformed. Should contain only positive data.
+
+    copy : boolean, optional, default is True
+        set to False to perform inplace transformation and avoid a


Capitalize set.

hlin117 · 2017-03-01T06:58:23Z

sklearn/preprocessing/data.py

+        ----------
+        X : array-like, shape [n_samples, n_features]
+            The data to fit by apply boxcox transform,
+            to each of the features and learn the lambda.


Please add

y : ignored

hlin117 · 2017-03-01T07:01:06Z

sklearn/preprocessing/data.py

+        self.lambdas_ = np.array([o[1] for o in out])
+        return self
+
+    def transform(self, X, y=None):


Please leave out y=None.

hlin117 · 2017-03-01T07:50:04Z

sklearn/preprocessing/data.py

+            # As of now, X is expected to be dense array
+            X[:, ind[sel]] = X_sel
+            return X
        else:


Nit: One of my pet peeves is inconsequential else statements. You can remove this else statements without any changes in your logic. (It reduces one level of indentations.)

hlin117 · 2017-03-01T08:16:00Z

sklearn/preprocessing/data.py

+    Royal Statistical Society B, 26, 211-252 (1964).
+    """
+    X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES, copy=copy)
+    if any(np.any(X <= 0, axis=0)):


You can simply do np.any(X <= 0).

hlin117 · 2017-03-01T08:16:41Z

sklearn/preprocessing/data.py

+            if self.transformed_features_.dtype == np.bool:
+                self.transformed_features_ = \
+                    np.where(self.transformed_features_)[0]
+        if any(np.any(X[:, self.transformed_features_] <= 0, axis=0)):


Do np.any(X[:, self.transformed_features_] <= 0).

maniteja123 · 2017-03-01T19:18:03Z

Thanks a lot @hlin117 for the review and suggestions. I have done most of them. Need to yet benchmark the parallel call in fit. Will ping once I incorporate that and all the pending changes. I will try to complete it by the weekend. Please let me know if there is any deadline since it is marked under 0.19 milestone. Thanks.

jnothman

Regarding parallelism:

I really doubt transform will benefit. Optimized instances of numpy are already likely to perform the operation with some low-level parallelism.
Parallel fit (and the function version) when lambda=None should be benchmarked. I suspect if it provides any gain, it will be with backend='threading'
even then, it might not be the right solution. If the aim is to estimate lambda, over large datasets it may be more appropriate to estimate it over a sub-sample than use multithreading.

amueller · 2017-03-31T16:24:59Z

I'm ok with power transformation, though people might indeed be looking for BoxCox which would then be harder to find? hm...

jnothman

Otherwise, I think this is looking good

jnothman · 2017-04-23T13:07:06Z

sklearn/preprocessing/data.py

+        return x
+
+
+def boxcox(X, copy=True):


Every time I look again at this PR, I think the function version is a bad idea. it will be misused and cause test data leakage.

jnothman · 2017-04-23T13:07:36Z

sklearn/preprocessing/data.py

+
+
+class BoxCoxTransformer(BaseEstimator, TransformerMixin):
+    """BoxCox transformation on individual features.


jnothman · 2017-04-23T13:07:42Z

sklearn/preprocessing/data.py

+class BoxCoxTransformer(BaseEstimator, TransformerMixin):
+    """BoxCox transformation on individual features.
+
+    Boxcox transform wil be applied on each feature (each column of


jnothman · 2017-04-23T13:26:07Z

sklearn/preprocessing/data.py

+        if np.any(X[:, self.transformed_features_] <= 0):
+            raise ValueError("BoxCox transform can only be applied "
+                             "on positive data")
+        out = Parallel(n_jobs=self.n_jobs)(delayed(_boxcox)(X, i,


I'd rather remove Parallel in the first version unless it's benchmarked. In particular, I don't think a multiprocessing backend will provide benefits.

I agree. No Parallel.

jnothman · 2017-04-23T13:27:02Z

sklearn/preprocessing/data.py


-def _transform_selected(X, transform, selected="all", copy=True):
+def _transform_selected(X, transform, selected="all", copy=True,
+                        retain_ordering=False):


test_transform_selected must be updated.

jnothman · 2017-04-23T13:46:35Z

sklearn/preprocessing/data.py

+        return X_tr
+
+    def _transform(self, X):
+        outputs = Parallel(n_jobs=self.n_jobs)(


This is really unlikely to help

dengemann · 2017-06-07T12:14:42Z

Regarding parallelism:

I really doubt transform will benefit. Optimized instances of numpy are already likely to perform the >operation with some low-level parallelism.
Parallel fit (and the function version) when lambda=None should be benchmarked. I suspect if it >provides any gain, it will be with backend='threading'
even then, it might not be the right solution. If the aim is to estimate lambda, over large datasets it may >be more appropriate to estimate it over a sub-sample than use multithreading.

I agree. I think we should avoid over-engineering and quickly provide a first working BoxCox API. This is really popular and is an essential standard tool. If you don't want to call it BoxCox but power functions at least the documentation should clearly make the link between the two, such that it can be found on Google. @maniteja123 Are you planning to work on this any time soon? I am at the sprint right now and might have some time to do a pass. I could send you a pull request or take over.

maniteja123 · 2017-06-08T05:17:36Z

Hi @dengemann , sorry for the delayed response. Please feel free to work over this. Either giving a PR or taking over is fine with me. Please let me know whichever is comfortable. Thanks !

dengemann · 2017-06-09T07:25:07Z

@maniteja123 my window of opportunity has passed, I won't be able to work on this the next days. Please keep me pinged. I'm happy to help with this PR.

Btw. we should also keep an eye on the Yeo-Johnsons transformation that can handle negative values as well. See also https://www.jstor.org/stable/2673623 and https://gist.github.com/mesgarpour/f24769cd186e2db853957b10ff6b7a95. cc @amueller @jnothman @ogrisel I'm wondering if this could even fit into one class or maybe at least one module, power_transforms.

jnothman · 2017-06-10T09:34:48Z

yes I've suggested as much somewhere above. thanks

…

On 9 Jun 2017 5:25 pm, "Denis A. Engemann" ***@***.***> wrote: @maniteja123 <https://github.com/maniteja123> my window of opportunity has passed, I won't be able to work on this the next days. Please keep me pinged. I'm happy to help with this PR. Btw. we should also keep an eye on the Yeo-Johnsons transformation that can handle negative values as well. See also https://www.jstor.org/stable/ 2673623 and https://gist.github.com/mesgarpour/ f24769cd186e2db853957b10ff6b7a95. cc @amueller <https://github.com/amueller> @jnothman <https://github.com/jnothman> @ogrisel <https://github.com/ogrisel> I'm wondering if this could even fit into one class or maybe at least one module, power_transforms. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6781 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_Sfv8VwS4L2S4alzxyxQKz5xJJyks5sCPNVgaJpZM4IemgO> .

jnothman · 2017-07-20T12:09:58Z

examples/preprocessing/plot_boxcox.py

@@ -0,0 +1,48 @@
+"""


Perhaps a section in plot_all_scaling.py will now suffice instead of this.

jnothman · 2017-07-30T05:11:02Z

@maniteja123 can please I open this to be finished by another contributor?

chang · 2017-11-18T01:15:31Z

@maniteja123 @jnothman I recently implemented a BoxCoxTransformer for my personal use - would love to pick this up if you are open to it.

jnothman · 2017-11-20T07:31:27Z

Yes, please do @ericchang00. I'm keen to see this merged

maniteja123 · 2017-11-20T14:02:57Z

@ericchang00 Please go ahead.

@jnothman Really sorry for not completing this till now.

chang · 2017-11-21T01:40:12Z

@jnothman @maniteja123 Sounds great. I'll open a new PR with this PR linked, about a week from now.

jnothman · 2017-12-01T03:01:47Z

oh I'm being inconsistent with myself aren't I? I would be fine to see a function form for consistency with other scaler-like things. particularly for when yeo-Johnson is also available

chang · 2017-12-01T03:05:19Z

Haha no worries! Sounds good to me

maniteja123 mentioned this pull request May 14, 2016

Add BoxCox Transform #6675

Closed

agramfort reviewed May 15, 2016
View reviewed changes

raghavrv reviewed May 15, 2016
View reviewed changes

maniteja123 force-pushed the boxcox_transform branch from ede4464 to f4f77c6 Compare May 15, 2016 17:08

GaelVaroquaux reviewed May 16, 2016
View reviewed changes

agramfort reviewed May 21, 2016
View reviewed changes

jnothman mentioned this pull request Feb 27, 2017

[MRG+2] Add fixed width discretization to scikit-learn #7668

Closed

8 tasks

hlin117 reviewed Mar 1, 2017

View reviewed changes

Some minor refactoring and doc changes

6097c13

maniteja123 force-pushed the boxcox_transform branch from 12245a1 to 6097c13 Compare March 1, 2017 19:22

jnothman reviewed Mar 5, 2017

View reviewed changes

jnothman reviewed Apr 23, 2017

View reviewed changes

jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017

jnothman reviewed Jul 20, 2017

View reviewed changes

jnothman removed the Waiting for Reviewer label Jul 30, 2017

chang mentioned this pull request Nov 27, 2017

[MRG+1] Feature: Implement PowerTransformer #10210

Merged

jnothman closed this in #10210 Dec 5, 2017

xuefeng-xu mentioned this pull request Aug 20, 2024

ENH Use scipy.special.inv_boxcox in PowerTransformer #27875

Merged



		def _transform_selected(X, transform, selected="all", copy=True):
		def _transform_selected(X, transform, selected="all", copy=True, order=False):



		class BoxCoxTransformer(BaseEstimator, TransformerMixin):
		"""BoxCox transformation on individual features.

Uh oh!

[MRG] ENH: Add boxcox transform to preprocess input data #6781

[MRG] ENH: Add boxcox transform to preprocess input data #6781

Conversation

maniteja123 commented May 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agramfort commented May 15, 2016

Uh oh!

maniteja123 commented May 15, 2016

Uh oh!

agramfort commented May 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maniteja123 May 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agramfort May 15, 2016 via email

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv May 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maniteja123 commented May 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented May 16, 2016

Uh oh!

maniteja123 commented May 16, 2016

Uh oh!

GaelVaroquaux commented May 16, 2016 via email

Uh oh!

maniteja123 commented May 16, 2016

Uh oh!

GaelVaroquaux commented May 16, 2016 via email

Uh oh!

maniteja123 commented May 16, 2016

Uh oh!

GaelVaroquaux commented May 16, 2016 via email

Uh oh!

maniteja123 commented May 16, 2016

Uh oh!

maniteja123 commented May 17, 2016

Uh oh!

GaelVaroquaux commented May 17, 2016

Uh oh!

maniteja123 commented May 17, 2016

Uh oh!

maniteja123 commented May 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

maniteja123 commented May 14, 2016 •

edited

Loading

maniteja123 May 15, 2016 •

edited

Loading

raghavrv May 15, 2016 •

edited

Loading