-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] CLN Encoder refactor #14972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] CLN Encoder refactor #14972
Conversation
Does it mean that the user would get one warning for each column?? |
This refactor was done in a way to have the caller do the warning handling. The pattern is to change an attribute in |
The public interface is exactly the same as we have on master. On a side note, I considered having the |
I am hoping this makes it much easier to implement encoders or at least try out different encoders. Most of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we get any regression regarding the performance?
sklearn/preprocessing/_encoders.py
Outdated
@@ -52,103 +61,112 @@ def _check_X(self, X): | |||
# to keep the dtype information to be used in the encoder. | |||
needs_validation = True | |||
|
|||
n_samples, n_features = X.shape | |||
n_features = X.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this variable
sklearn/preprocessing/_encoders.py
Outdated
@@ -52,103 +61,112 @@ def _check_X(self, X): | |||
# to keep the dtype information to be used in the encoder. | |||
needs_validation = True | |||
|
|||
n_samples, n_features = X.shape | |||
n_features = X.shape[1] | |||
X_columns = [] | |||
|
|||
for i in range(n_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features => X.shape[1]
sklearn/preprocessing/_encoders.py
Outdated
X_mask = np.ones((n_samples, n_features), dtype=np.bool) | ||
def _fit_list(self, X_list, encoders): | ||
"""Fit encoders on X_list""" | ||
assert len(X_list) == len(encoders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an additional message in case we raise the AssertionError. It will not be really friendly.
sklearn/preprocessing/_encoders.py
Outdated
|
||
# map from X_trans indicies to indicies from the original X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices
sklearn/preprocessing/_encoders.py
Outdated
if n_features != len(self.categories_): | ||
raise ValueError( | ||
"The number of features in X is different to the number of " | ||
"features of the fitted data. The fitted data had {} features " | ||
"and the X has {} features." | ||
.format(len(self.categories_,), n_features) | ||
) | ||
.format(len(self.categories_,), n_features)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can revert this change. Black style is fine :)
categories = self._check_categories(n_features) | ||
drop_kwargs = self._check_drop(n_features) | ||
|
||
encoders = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pattern
encoders = [...]
self._fit_list(...)
will be common to each encoder. Would it make sense to have an abstract method across encoder which should be defined in all children?
Then _fit
will just be an input validation (which could also be an abstract method) and the above pattern (+ additinonal thing depending of the encoder)
def _fit(self, X):
X = self._validate_input(X)
self._generate_and_fit_encoder(X)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But be aware that this is not clear to me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am glad you noticed this! One of my designs had something similar where _fit
was an abstract method. It was even possible to get to a point where the child class only needed to define _fit
and _EncoderUnion
can define fit
, transform
, and fit_transform
. It works, but it felt a little too magically. (There were additional lines to overwrite the docstring for transform
, and fit_transform
for a specific child encoder. For example, OneHotEncoder.transform
can output sparse matrices.)
The benefits of this PR, is that OrdinalEncoder.fit
explicitly defines the encoders, and explicitly passes it up into the parent class to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits about style for now (sorry)
sklearn/preprocessing/_encoders.py
Outdated
|
||
return categories | ||
|
||
def _fit_list(self, X_list, encoders): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help a bit to call all single encoders single_encoder
instead of encoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fit_list isn't super descriptive. fit_single_encoders maybe?
sklearn/preprocessing/_encoders.py
Outdated
for X_col, encoder in zip(X_list, encoders): | ||
encoder.fit(X_col) | ||
|
||
# map from X_trans indices to indices from the original X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# map from X_trans indices to indices from the original X | |
# map X_trans indices to indices from the original X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please indicate why this is needed
sklearn/preprocessing/_encoders.py
Outdated
Xi = X_list[i] | ||
diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i], | ||
return_mask=True) | ||
X_trs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_trans??
sklearn/preprocessing/_encoders.py
Outdated
X_trans_to_orig_idx = [] | ||
X_trans_idx = 0 | ||
for encoder in encoders: | ||
n_feat_out = encoder.n_features_out_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features_out...
sklearn/preprocessing/_encoders.py
Outdated
X_trans_to_orig_idx.append((begin, end)) | ||
X_trans_idx += n_feat_out | ||
|
||
self._X_trans_to_orig_idx = X_trans_to_orig_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the reverse mapping? From (0, n_features - 1) to (0, X_trans.shape[1]) ?
sklearn/preprocessing/_encoders.py
Outdated
""" | ||
Base class for encoders that includes the code to categorize and | ||
class _EncoderUnion(TransformerMixin, BaseEstimator): | ||
"""Base class for encoders that includes the code to categorize and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Base class for encoders that includes the code to categorize and | |
"""Base class for encoders that includes the code to encode and |
sklearn/preprocessing/_encoders.py
Outdated
""" | ||
Base class for encoders that includes the code to categorize and | ||
class _EncoderUnion(TransformerMixin, BaseEstimator): | ||
"""Base class for encoders that includes the code to categorize and | ||
transform the input features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... one by one.
sklearn/preprocessing/_encoders.py
Outdated
|
||
return X_int, X_mask | ||
def _hstack(self, Xs): | ||
if any(sparse.issparse(X_tran) for X_tran in Xs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_trans let's be consistent please
if X.dtype != object and not np.all(np.sort(cats) == cats): | ||
raise ValueError("Unsorted categories are not " | ||
"supported for numerical categories") | ||
diff = _encode_check_unknown(X, cats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the diff this is protected by if handle_unknown == 'error':
in master.
(Not sure if relevant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_SingleOrdinalEncoder
will always error since it can not handle unknown (at the moment).
sklearn/preprocessing/_encoders.py
Outdated
self.drop_idx_ = None | ||
elif self.drop not in self.categories_: | ||
# This is an error state. Caller should check this value and | ||
# handle according. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# handle according. | |
# handle accordingly. |
Ok so this PR adds some overhead because of how the sparse matrices are from itertools import product
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from neurtu import Benchmark, delayed
n_samples = [50000, 100000, 200000, 500000]
n_features = [50, 100]
def benchmark_onehot_encoder(encoder):
rng = np.random.RandomState(42)
for n_sample, n_feature in product(n_samples, n_features):
clf = encoder()
X = rng.randint(0, 50, size=(n_sample, n_feature))
yield delayed(clf,
tags={
'n_sample': n_sample,
'n_feature': n_feature,
}).fit_transform(X)
encoder = OneHotEncoder
bench = Benchmark(wall_time=True)
df = bench(benchmark_onehot_encoder(encoder))
print(df) And the results: PR wall_time
n_sample n_feature
50000 50 0.394461
100 0.790315
100000 50 0.824488
100 1.671744
200000 50 1.645967
100 3.855874
500000 50 4.927292
100 10.548770 Master wall_time
n_sample n_feature
50000 50 0.287379
100 0.626392
100000 50 0.621225
100 1.496794
200000 50 1.356239
100 3.314679
500000 50 4.053212
100 8.872062 |
what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices? |
Doublish. ;) (As expected) I wish there was a good way to build a sparse matrix incrementally. I have tried Currently this PR lets the single encoders construct a CSC and then this is hstacked and converted into a csr. This is the fastest way, but still needs to allocated more memory for the stacked matrix. I may need make a compromise to get memory usage the same as master. The |
I think you could probably engineer some kind of CSCBuilder especially
since here you know the maximum nnz. But I'd consider this premature
optimisation if the rest of the refactor is clearly beneficial. (I've not
reviewed)
|
@thomasjpfan told me he gave up ;) |
This PR refactors
_encoders.py
as follows:_SingleOneHotEncoder
and_SingleOrdinalEncoder
to encode a single categorical feature. They implement,fit
,transform
andinverse_transform
.transform
andfit
operates onX
of shape(n_samples,)
._EncoderUnion
that provides utilities for child classes to call, which will distribute the features to the encoders:_check_X
does validation checks and returns a list of features,X_list
_fit_list
takes a list of encoders and sends the elements ofX_list
to their respected encoders to be fitted._transform_list
takes a list ofX_list
and sends the elements ofX_list
to their respected encoders to be transformed.inverse_transform
is defined for all encoders.OneHotEncoder
andOrdinalEncoder
does input validation and uses the utilities provided by_EncoderUnion
.The motivation of this design is to separate the handling of distributing feature columns and encoding. When we introduce more encoders in the future, we would need to implement a child class of
_EncoderUnion
and a_Single***Encoder
. This refactor should also make it easier to implement more features into our encoders such as #14954.CC: @NicolasHug