-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
[MRG] ENH: Add boxcox transform to preprocess input data #6781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/preprocessing/data.py
Outdated
| Royal Statistical Society B, 26, 211-252 (1964). | ||
| """ | ||
| X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES) | ||
| if any(np.any(X<=0, axis=0)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep8
|
Can you also add a transformer class? It should accept as param the indices of the features to transform |
|
Yeah sure. Can you please tell if the API should be on the lines of |
see if it has been done in other preprocessing estimators. I would call it |
sklearn/preprocessing/data.py
Outdated
| self.categorical_features, copy=True) | ||
|
|
||
|
|
||
| def _boxcox(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Why can't we just do a stats.boxcox(X)[0] at the required place(s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was was hoping to use the Parallel helper here. And couldn't figure how to take only a part of the input while using it. If that is an overkill, will make it a vectorise loop like everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see you need to pass this for the parallel computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think you could pass the stats.boxcox as such and extract the desired output in one go? Would be cleaner? No strong opinion though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide how to achieve that since I couldn't find a way to
extract the first part of the tuple output?
On 15 May 2016 7:29 pm, "Raghav R V" [email protected] wrote:
In sklearn/preprocessing/data.py
#6781 (comment)
:@@ -1917,3 +1920,38 @@ def transform(self, X):
"""
return _transform_selected(X, self._transform,
self.categorical_features, copy=True)
+
+
+def _boxcox(X):But I think you could pass the stats.boxcox as such and extract the
desired output in one go? Would be cleaner? No strong opinion though.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/6781/files/14f14c70b82d373a4a6b7f28356bebd1d17b5f56#r63292049
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it simple.
Indeed. @maniteja123 nvm... Leave it as such. (Just incase you are curious, you could zip(*output)[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure. Thanks for the tip.
|
There is a function |
sklearn/preprocessing/data.py
Outdated
| X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES) | ||
| if any(np.any(X<=0, axis=0)): | ||
| raise ValueError("BoxCox transform can only be applied on positive data") | ||
| n_samples, n_features = X.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_, n_features
ede4464 to
f4f77c6
Compare
sklearn/preprocessing/data.py
Outdated
| if any(np.any(X<=0, axis=0)): | ||
| raise ValueError("BoxCox transform can only be applied on positive data") | ||
| n_samples, n_features = X.shape | ||
| outputs = Parallel()(delayed(_boxcox)(X[:, i]) for i in range(n_features)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to specify the number of jobs to Parallel, elsewhere is runs in series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks will change this.
|
Thanks for the PR. You'll need a transformer that learns the parameters of the boxcox transform during fit, and that applies it to new data during transform. You'll also need an example. For instance, using the Boston Housing data, where you learn the transform on half of the data, and you apply it to the other half, and you show histograms of the features before and after the transform. |
|
I have one more question. I was using the |
|
You can pass lambda to the boxplot function of scipy.
|
|
Sorry if I wasn't clear in my question. I was wondering how to pass it to the |
|
Store the lambdas learned from fit as an attribute of the transformer.
|
|
Sorry for so many questions. But for the boxcox transform, the input should be positive data but the boston dataset has some zero values. And also could you please elaborate on "histograms of features" ? Thanks. |
|
But for the boxcox transform, the input should be positive data but the
boston dataset has some zero values.
Good point. Could you apply it only on the features that don't have
zeros.
And also could you please elaborate on "histograms of features" ?
My idea was to plot the marginal distribution for the features before and
after transformation.
|
|
Thanks will try that. There are only two features with zero as a value. And IIUC, should plot the marginal distribution for each of the 11 features right ? |
|
It's really not what I was expecting. I was expecting the distribution after the box cox to look more normal, like on https://www.isixsigma.com/tools-templates/normality/making-data-normal-using-box-cox-power-transformation/ In addition, you have a failing test: |
|
Sorry I chose a bad feature. Would something like this be better, though the plot doesn't exactly looks like normal after the transform. I will push the code so that it is possible to check if I am doing it wrong. Also the common tests are failing because |
|
Sorry for pinging again, but this was the best feature which showed some amount of transformation ( at least visually). Would it be better to try shuffling the datasets and then do the transformation or try on any other dataset ? Thanks. |
sklearn/preprocessing/data.py
Outdated
| n_features = X.shape[1] | ||
| outputs = Parallel(n_jobs=-1)(delayed(_boxcox)(X[:, i], None, True) for i in range(n_features)) | ||
| output = np.concatenate([o[0][..., np.newaxis] for o in outputs], axis=1) | ||
| self.lambdas = np.array([o[1] for o in outputs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimated attributes should end with _ So:
self.lambdas_ = np.array([o[1] for o in outputs])
sklearn/preprocessing/data.py
Outdated
|
|
||
|
|
||
| def _transform_selected(X, transform, selected="all", copy=True): | ||
| def _transform_selected(X, transform, selected="all", copy=True, order=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In certain functions like check_array, order is a keyword argument which describes how the array is ordered.
I'd instead rename this parameter to something like retain_ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate issue: now that other classes from modules outside of data.py are using this function, maybe we should create a utils.py and place transform_selected in it.
Might add that as a todo for a future pull request...
sklearn/preprocessing/data.py
Outdated
| X : array-like, shape (n_samples, n_features) | ||
| The data to be transformed. Should contain only positive data. | ||
| copy : boolean, optional, default is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To retain consistency in this file's docs, I'd say default=True.
sklearn/preprocessing/data.py
Outdated
| The data to be transformed. Should contain only positive data. | ||
| copy : boolean, optional, default is True | ||
| set to False to perform inplace transformation and avoid a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize set.
| ---------- | ||
| X : array-like, shape [n_samples, n_features] | ||
| The data to fit by apply boxcox transform, | ||
| to each of the features and learn the lambda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add
y : ignored
sklearn/preprocessing/data.py
Outdated
| self.lambdas_ = np.array([o[1] for o in out]) | ||
| return self | ||
|
|
||
| def transform(self, X, y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave out y=None.
| # As of now, X is expected to be dense array | ||
| X[:, ind[sel]] = X_sel | ||
| return X | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: One of my pet peeves is inconsequential else statements. You can remove this else statements without any changes in your logic. (It reduces one level of indentations.)
sklearn/preprocessing/data.py
Outdated
| Royal Statistical Society B, 26, 211-252 (1964). | ||
| """ | ||
| X = check_array(X, ensure_2d=True, dtype=FLOAT_DTYPES, copy=copy) | ||
| if any(np.any(X <= 0, axis=0)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply do np.any(X <= 0).
sklearn/preprocessing/data.py
Outdated
| if self.transformed_features_.dtype == np.bool: | ||
| self.transformed_features_ = \ | ||
| np.where(self.transformed_features_)[0] | ||
| if any(np.any(X[:, self.transformed_features_] <= 0, axis=0)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do np.any(X[:, self.transformed_features_] <= 0).
|
Thanks a lot @hlin117 for the review and suggestions. I have done most of them. Need to yet benchmark the parallel call in fit. Will ping once I incorporate that and all the pending changes. I will try to complete it by the weekend. Please let me know if there is any deadline since it is marked under 0.19 milestone. Thanks. |
12245a1 to
6097c13
Compare
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding parallelism:
- I really doubt
transformwill benefit. Optimized instances of numpy are already likely to perform the operation with some low-level parallelism. - Parallel fit (and the function version) when
lambda=Noneshould be benchmarked. I suspect if it provides any gain, it will be withbackend='threading' - even then, it might not be the right solution. If the aim is to estimate lambda, over large datasets it may be more appropriate to estimate it over a sub-sample than use multithreading.
|
I'm ok with power transformation, though people might indeed be looking for BoxCox which would then be harder to find? hm... |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I think this is looking good
| return x | ||
|
|
||
|
|
||
| def boxcox(X, copy=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time I look again at this PR, I think the function version is a bad idea. it will be misused and cause test data leakage.
|
|
||
|
|
||
| class BoxCoxTransformer(BaseEstimator, TransformerMixin): | ||
| """BoxCox transformation on individual features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Box-Cox
| class BoxCoxTransformer(BaseEstimator, TransformerMixin): | ||
| """BoxCox transformation on individual features. | ||
| Boxcox transform wil be applied on each feature (each column of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*Box-Cox
| if np.any(X[:, self.transformed_features_] <= 0): | ||
| raise ValueError("BoxCox transform can only be applied " | ||
| "on positive data") | ||
| out = Parallel(n_jobs=self.n_jobs)(delayed(_boxcox)(X, i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather remove Parallel in the first version unless it's benchmarked. In particular, I don't think a multiprocessing backend will provide benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. No Parallel.
|
|
||
| def _transform_selected(X, transform, selected="all", copy=True): | ||
| def _transform_selected(X, transform, selected="all", copy=True, | ||
| retain_ordering=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_transform_selected must be updated.
| return X_tr | ||
|
|
||
| def _transform(self, X): | ||
| outputs = Parallel(n_jobs=self.n_jobs)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really unlikely to help
I agree. I think we should avoid over-engineering and quickly provide a first working BoxCox API. This is really popular and is an essential standard tool. If you don't want to call it BoxCox but power functions at least the documentation should clearly make the link between the two, such that it can be found on Google. @maniteja123 Are you planning to work on this any time soon? I am at the sprint right now and might have some time to do a pass. I could send you a pull request or take over. |
|
Hi @dengemann , sorry for the delayed response. Please feel free to work over this. Either giving a PR or taking over is fine with me. Please let me know whichever is comfortable. Thanks ! |
|
@maniteja123 my window of opportunity has passed, I won't be able to work on this the next days. Please keep me pinged. I'm happy to help with this PR. Btw. we should also keep an eye on the Yeo-Johnsons transformation that can handle negative values as well. See also https://www.jstor.org/stable/2673623 and https://gist.github.com/mesgarpour/f24769cd186e2db853957b10ff6b7a95. cc @amueller @jnothman @ogrisel I'm wondering if this could even fit into one class or maybe at least one module, power_transforms. |
|
yes I've suggested as much somewhere above. thanks
…On 9 Jun 2017 5:25 pm, "Denis A. Engemann" ***@***.***> wrote:
@maniteja123 <https://github.com/maniteja123> my window of opportunity
has passed, I won't be able to work on this the next days. Please keep me
pinged. I'm happy to help with this PR.
Btw. we should also keep an eye on the Yeo-Johnsons transformation that
can handle negative values as well. See also https://www.jstor.org/stable/
2673623 and https://gist.github.com/mesgarpour/
f24769cd186e2db853957b10ff6b7a95. cc @amueller
<https://github.com/amueller> @jnothman <https://github.com/jnothman>
@ogrisel <https://github.com/ogrisel> I'm wondering if this could even
fit into one class or maybe at least one module, power_transforms.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6781 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_Sfv8VwS4L2S4alzxyxQKz5xJJyks5sCPNVgaJpZM4IemgO>
.
|
| @@ -0,0 +1,48 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a section in plot_all_scaling.py will now suffice instead of this.
|
@maniteja123 can please I open this to be finished by another contributor? |
|
@maniteja123 @jnothman I recently implemented a BoxCoxTransformer for my personal use - would love to pick this up if you are open to it. |
|
Yes, please do @ericchang00. I'm keen to see this merged |
|
@ericchang00 Please go ahead. @jnothman Really sorry for not completing this till now. |
|
@jnothman @maniteja123 Sounds great. I'll open a new PR with this PR linked, about a week from now. |
|
oh I'm being inconsistent with myself aren't I? I would be fine to see a
function form for consistency with other scaler-like things. particularly
for when yeo-Johnson is also available
|
|
Haha no worries! Sounds good to me |


Reference Issue
See #6675
What does this implement/fix? Explain your changes.
Implement boxcox transform. The current approach is apply univariate on each feature with lambda being evaluated for maximising the log likelihood
Any other comments?
This is just my initial attempt and am not sure if it is supposed to be computed in this manner. The documentation and tests need to be made better but first wanted to ask if this the expected functionality. Please do give any suggestions on improving this. Thanks.
This change is