Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented May 14, 2016

Reference Issue

See #6675

What does this implement/fix? Explain your changes.

Implement boxcox transform. The current approach is apply univariate on each feature with lambda being evaluated for maximising the log likelihood

Any other comments?

This is just my initial attempt and am not sure if it is supposed to be computed in this manner. The documentation and tests need to be made better but first wanted to ask if this the expected functionality. Please do give any suggestions on improving this. Thanks.


This change is Reviewable

Royal Statistical Society B, 26, 211-252 (1964).
"""
X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES)
if any(np.any(X<=0, axis=0)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pep8

@agramfort
Copy link
Member

Can you also add a transformer class? It should accept as param the indices of the features to transform

@maniteja123
Copy link
Contributor Author

Yeah sure. Can you please tell if the API should be on the lines of Normalizer ? And what should be the name of parameter specifying the features to be transformed? Also should I also add the option to provide the features to function boxcox_transform. Sorry for so many doubts but wanted to make sure I am proceeding in the right direction. Thanks.

@agramfort
Copy link
Member

Yeah sure. Can you please tell if the API should be on the lines of
Normalizer ?

yes with just a copy param/

And what should be the name of parameter specifying the features to be
transformed? Also should I also add the option to provide the features to
function boxcox_transform. Sorry for so many doubts but wanted to make
sure I am proceeding in the right direction. Thanks.

see if it has been done in other preprocessing estimators. I would call it
feature_indices=None | array-like of int

self.categorical_features, copy=True)


def _boxcox(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Why can't we just do a stats.boxcox(X)[0] at the required place(s)?

Copy link
Contributor Author

@maniteja123 maniteja123 May 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was was hoping to use the Parallel helper here. And couldn't figure how to take only a part of the input while using it. If that is an overkill, will make it a vectorise loop like everywhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see you need to pass this for the parallel computation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think you could pass the stats.boxcox as such and extract the desired output in one go? Would be cleaner? No strong opinion though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide how to achieve that since I couldn't find a way to
extract the first part of the tuple output?
On 15 May 2016 7:29 pm, "Raghav R V" [email protected] wrote:

In sklearn/preprocessing/data.py
#6781 (comment)
:

@@ -1917,3 +1920,38 @@ def transform(self, X):
"""
return _transform_selected(X, self._transform,
self.categorical_features, copy=True)
+
+
+def _boxcox(X):

But I think you could pass the stats.boxcox as such and extract the
desired output in one go? Would be cleaner? No strong opinion though.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6781/files/14f14c70b82d373a4a6b7f28356bebd1d17b5f56#r63292049

Copy link
Member

@raghavrv raghavrv May 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it simple.

Indeed. @maniteja123 nvm... Leave it as such. (Just incase you are curious, you could zip(*output)[0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure. Thanks for the tip.

@maniteja123
Copy link
Contributor Author

There is a function _transform_selected which is used in OneHotEncoder to apply transform on selected features but from what I can understand it transforms those features and stacks it with the untransformed features. But the relative positions of the features must be preserved right ?

X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES)
if any(np.any(X<=0, axis=0)):
raise ValueError("BoxCox transform can only be applied on positive data")
n_samples, n_features = X.shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_, n_features

if any(np.any(X<=0, axis=0)):
raise ValueError("BoxCox transform can only be applied on positive data")
n_samples, n_features = X.shape
outputs = Parallel()(delayed(_boxcox)(X[:, i]) for i in range(n_features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to specify the number of jobs to Parallel, elsewhere is runs in series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks will change this.

@GaelVaroquaux
Copy link
Member

Thanks for the PR. You'll need a transformer that learns the parameters of the boxcox transform during fit, and that applies it to new data during transform.

You'll also need an example. For instance, using the Boston Housing data, where you learn the transform on half of the data, and you apply it to the other half, and you show histograms of the features before and after the transform.

@maniteja123
Copy link
Contributor Author

I have one more question. I was using the _transform_selected to perform the transform on selected features. But if we have to use the lambdas learnt from fit, this helper function can't be used because it doesn't provide the option to pass any additional parameters to the transform function. Would it be okay to replicate the code of boxcox function in the BoxCoxTransformer so that it can access the lambdas learnt from fit ?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 16, 2016 via email

@maniteja123
Copy link
Contributor Author

Sorry if I wasn't clear in my question. I was wondering how to pass it to the boxcox function here since it is a separate function and can't access the lambdas learnt from fit.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 16, 2016 via email

@maniteja123
Copy link
Contributor Author

Sorry for so many questions. But for the boxcox transform, the input should be positive data but the boston dataset has some zero values. And also could you please elaborate on "histograms of features" ? Thanks.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 16, 2016 via email

@maniteja123
Copy link
Contributor Author

Thanks will try that. There are only two features with zero as a value. And IIUC, should plot the marginal distribution for each of the 11 features right ?

@maniteja123
Copy link
Contributor Author

I have tried plotting it for feature 1. Please do have a look and let me know if it is the expected plot. Thanks.
figure_1

@GaelVaroquaux
Copy link
Member

It's really not what I was expecting. I was expecting the distribution after the box cox to look more normal, like on https://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/

In addition, you have a failing test:
https://travis-ci.org/scikit-learn/scikit-learn/jobs/130406967

@maniteja123
Copy link
Contributor Author

Sorry I chose a bad feature. Would something like this be better, though the plot doesn't exactly looks like normal after the transform. I will push the code so that it is possible to check if I am doing it wrong. Also the common tests are failing because test_non_meta_estimators is giving data with non-positive values. What should be done here ? Thanks

figure_10

@maniteja123
Copy link
Contributor Author

Sorry for pinging again, but this was the best feature which showed some amount of transformation ( at least visually). Would it be better to try shuffling the datasets and then do the transformation or try on any other dataset ? Thanks.

n_features = X.shape[1]
outputs = Parallel(n_jobs=-1)(delayed(_boxcox)(X[:, i], None, True) for i in range(n_features))
output = np.concatenate([o[0][..., np.newaxis] for o in outputs], axis=1)
self.lambdas = np.array([o[1] for o in outputs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimated attributes should end with _ So:

self.lambdas_ = np.array([o[1] for o in outputs])



def _transform_selected(X, transform, selected="all", copy=True):
def _transform_selected(X, transform, selected="all", copy=True, order=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In certain functions like check_array, order is a keyword argument which describes how the array is ordered.

I'd instead rename this parameter to something like retain_ordering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue: now that other classes from modules outside of data.py are using this function, maybe we should create a utils.py and place transform_selected in it.

Might add that as a todo for a future pull request...

X : array-like, shape (n_samples, n_features)
The data to be transformed. Should contain only positive data.
copy : boolean, optional, default is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To retain consistency in this file's docs, I'd say default=True.

The data to be transformed. Should contain only positive data.
copy : boolean, optional, default is True
set to False to perform inplace transformation and avoid a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize set.

----------
X : array-like, shape [n_samples, n_features]
The data to fit by apply boxcox transform,
to each of the features and learn the lambda.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add

y : ignored

self.lambdas_ = np.array([o[1] for o in out])
return self

def transform(self, X, y=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave out y=None.

# As of now, X is expected to be dense array
X[:, ind[sel]] = X_sel
return X
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: One of my pet peeves is inconsequential else statements. You can remove this else statements without any changes in your logic. (It reduces one level of indentations.)

Royal Statistical Society B, 26, 211-252 (1964).
"""
X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES, copy=copy)
if any(np.any(X <= 0, axis=0)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply do np.any(X <= 0).

if self.transformed_features_.dtype == np.bool:
self.transformed_features_ = \
np.where(self.transformed_features_)[0]
if any(np.any(X[:, self.transformed_features_] <= 0, axis=0)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do np.any(X[:, self.transformed_features_] <= 0).

@maniteja123
Copy link
Contributor Author

Thanks a lot @hlin117 for the review and suggestions. I have done most of them. Need to yet benchmark the parallel call in fit. Will ping once I incorporate that and all the pending changes. I will try to complete it by the weekend. Please let me know if there is any deadline since it is marked under 0.19 milestone. Thanks.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding parallelism:

  • I really doubt transform will benefit. Optimized instances of numpy are already likely to perform the operation with some low-level parallelism.
  • Parallel fit (and the function version) when lambda=None should be benchmarked. I suspect if it provides any gain, it will be with backend='threading'
  • even then, it might not be the right solution. If the aim is to estimate lambda, over large datasets it may be more appropriate to estimate it over a sub-sample than use multithreading.

@amueller
Copy link
Member

I'm ok with power transformation, though people might indeed be looking for BoxCox which would then be harder to find? hm...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I think this is looking good

return x


def boxcox(X, copy=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time I look again at this PR, I think the function version is a bad idea. it will be misused and cause test data leakage.



class BoxCoxTransformer(BaseEstimator, TransformerMixin):
"""BoxCox transformation on individual features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Box-Cox

class BoxCoxTransformer(BaseEstimator, TransformerMixin):
"""BoxCox transformation on individual features.
Boxcox transform wil be applied on each feature (each column of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*Box-Cox

if np.any(X[:, self.transformed_features_] <= 0):
raise ValueError("BoxCox transform can only be applied "
"on positive data")
out = Parallel(n_jobs=self.n_jobs)(delayed(_boxcox)(X, i,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather remove Parallel in the first version unless it's benchmarked. In particular, I don't think a multiprocessing backend will provide benefits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. No Parallel.


def _transform_selected(X, transform, selected="all", copy=True):
def _transform_selected(X, transform, selected="all", copy=True,
retain_ordering=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_transform_selected must be updated.

return X_tr

def _transform(self, X):
outputs = Parallel(n_jobs=self.n_jobs)(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really unlikely to help

@dengemann
Copy link
Contributor

Regarding parallelism:

I really doubt transform will benefit. Optimized instances of numpy are already likely to perform the >operation with some low-level parallelism.
Parallel fit (and the function version) when lambda=None should be benchmarked. I suspect if it >provides any gain, it will be with backend='threading'
even then, it might not be the right solution. If the aim is to estimate lambda, over large datasets it may >be more appropriate to estimate it over a sub-sample than use multithreading.

I agree. I think we should avoid over-engineering and quickly provide a first working BoxCox API. This is really popular and is an essential standard tool. If you don't want to call it BoxCox but power functions at least the documentation should clearly make the link between the two, such that it can be found on Google. @maniteja123 Are you planning to work on this any time soon? I am at the sprint right now and might have some time to do a pass. I could send you a pull request or take over.

@maniteja123
Copy link
Contributor Author

Hi @dengemann , sorry for the delayed response. Please feel free to work over this. Either giving a PR or taking over is fine with me. Please let me know whichever is comfortable. Thanks !

@dengemann
Copy link
Contributor

@maniteja123 my window of opportunity has passed, I won't be able to work on this the next days. Please keep me pinged. I'm happy to help with this PR.

Btw. we should also keep an eye on the Yeo-Johnsons transformation that can handle negative values as well. See also https://www.jstor.org/stable/2673623 and https://gist.github.com/mesgarpour/f24769cd186e2db853957b10ff6b7a95. cc @amueller @jnothman @ogrisel I'm wondering if this could even fit into one class or maybe at least one module, power_transforms.

@jnothman
Copy link
Member

jnothman commented Jun 10, 2017 via email

@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017
@@ -0,0 +1,48 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a section in plot_all_scaling.py will now suffice instead of this.

@jnothman
Copy link
Member

@maniteja123 can please I open this to be finished by another contributor?

@chang
Copy link
Contributor

chang commented Nov 18, 2017

@maniteja123 @jnothman I recently implemented a BoxCoxTransformer for my personal use - would love to pick this up if you are open to it.

@jnothman
Copy link
Member

Yes, please do @ericchang00. I'm keen to see this merged

@maniteja123
Copy link
Contributor Author

@ericchang00 Please go ahead.

@jnothman Really sorry for not completing this till now.

@chang
Copy link
Contributor

chang commented Nov 21, 2017

@jnothman @maniteja123 Sounds great. I'll open a new PR with this PR linked, about a week from now.

@jnothman
Copy link
Member

jnothman commented Dec 1, 2017 via email

@chang
Copy link
Contributor

chang commented Dec 1, 2017

Haha no worries! Sounds good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.