Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] CLN Encoder refactor #14972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 13, 2019

This PR refactors _encoders.py as follows:

  1. Introduces _SingleOneHotEncoder and _SingleOrdinalEncoder to encode a single categorical feature. They implement, fit, transform and inverse_transform. transform and fit operates on X of shape (n_samples,).
  2. Introduces _EncoderUnion that provides utilities for child classes to call, which will distribute the features to the encoders:
  • _check_X does validation checks and returns a list of features, X_list
  • _fit_list takes a list of encoders and sends the elements of X_list to their respected encoders to be fitted.
  • _transform_list takes a list of X_list and sends the elements of X_list to their respected encoders to be transformed.
  • inverse_transform is defined for all encoders.
  1. OneHotEncoder and OrdinalEncoder does input validation and uses the utilities provided by _EncoderUnion.

The motivation of this design is to separate the handling of distributing feature columns and encoding. When we introduce more encoders in the future, we would need to implement a child class of _EncoderUnion and a _Single***Encoder. This refactor should also make it easier to implement more features into our encoders such as #14954.

CC: @NicolasHug

@jnothman
Copy link
Member

This refactor should also make it easier to implement more features into our encoders such as #14954.

Does it mean that the user would get one warning for each column??

@thomasjpfan
Copy link
Member Author

Does it mean that the user would get one warning for each column??

This refactor was done in a way to have the caller do the warning handling. The pattern is to change an attribute in _SingleOneHotEncoder that the caller will look for if there is a exception to throw. For example, in _SingleOneHotEncoder, drop_idx_ is set to missing if drop does not appear in categories_. This way the caller, OneHotEncoder, can check the attribute and raise an error container all the categories that error.

@thomasjpfan thomasjpfan changed the title [MRG] CLN Encoder refactor v2 [WIP] CLN Encoder refactor v2 Sep 13, 2019
@thomasjpfan thomasjpfan changed the title [WIP] CLN Encoder refactor v2 [MRG] CLN Encoder refactor Sep 13, 2019
@thomasjpfan
Copy link
Member Author

The public interface is exactly the same as we have on master.

On a side note, I considered having the _SingleOneHotEncoder raise an exception so the caller can gather the exceptions. But without using custom exceptions, it would be hard for the caller to distinguish between exceptions. Using the string from the exception seems a little too hacky. So this PR implements a procedure to "set an attribute if the _Single***Encoder is in a bad state". It is kind of similar to fit_status_ in our SVMs.

@thomasjpfan
Copy link
Member Author

I am hoping this makes it much easier to implement encoders or at least try out different encoders. Most of OneHotEncoder and OrdinalEncoder is doing input validation, all the heavy work is done by _Single***Encoder and it has the restricted domain of thinking about a single categorical feature.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get any regression regarding the performance?

@@ -52,103 +61,112 @@ def _check_X(self, X):
# to keep the dtype information to be used in the encoder.
needs_validation = True

n_samples, n_features = X.shape
n_features = X.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this variable

@@ -52,103 +61,112 @@ def _check_X(self, X):
# to keep the dtype information to be used in the encoder.
needs_validation = True

n_samples, n_features = X.shape
n_features = X.shape[1]
X_columns = []

for i in range(n_features):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features => X.shape[1]

X_mask = np.ones((n_samples, n_features), dtype=np.bool)
def _fit_list(self, X_list, encoders):
"""Fit encoders on X_list"""
assert len(X_list) == len(encoders)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an additional message in case we raise the AssertionError. It will not be really friendly.


# map from X_trans indicies to indicies from the original X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices

if n_features != len(self.categories_):
raise ValueError(
"The number of features in X is different to the number of "
"features of the fitted data. The fitted data had {} features "
"and the X has {} features."
.format(len(self.categories_,), n_features)
)
.format(len(self.categories_,), n_features))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can revert this change. Black style is fine :)

categories = self._check_categories(n_features)
drop_kwargs = self._check_drop(n_features)

encoders = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern

encoders = [...]
self._fit_list(...)

will be common to each encoder. Would it make sense to have an abstract method across encoder which should be defined in all children?

Then _fit will just be an input validation (which could also be an abstract method) and the above pattern (+ additinonal thing depending of the encoder)

def _fit(self, X):
    X = self._validate_input(X)
    self._generate_and_fit_encoder(X)
    ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But be aware that this is not clear to me :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am glad you noticed this! One of my designs had something similar where _fit was an abstract method. It was even possible to get to a point where the child class only needed to define _fit and _EncoderUnion can define fit, transform, and fit_transform. It works, but it felt a little too magically. (There were additional lines to overwrite the docstring for transform, and fit_transform for a specific child encoder. For example, OneHotEncoder.transform can output sparse matrices.)

The benefits of this PR, is that OrdinalEncoder.fit explicitly defines the encoders, and explicitly passes it up into the parent class to process.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits about style for now (sorry)


return categories

def _fit_list(self, X_list, encoders):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help a bit to call all single encoders single_encoder instead of encoder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fit_list isn't super descriptive. fit_single_encoders maybe?

for X_col, encoder in zip(X_list, encoders):
encoder.fit(X_col)

# map from X_trans indices to indices from the original X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# map from X_trans indices to indices from the original X
# map X_trans indices to indices from the original X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please indicate why this is needed

Xi = X_list[i]
diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
return_mask=True)
X_trs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_trans??

X_trans_to_orig_idx = []
X_trans_idx = 0
for encoder in encoders:
n_feat_out = encoder.n_features_out_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_features_out...

X_trans_to_orig_idx.append((begin, end))
X_trans_idx += n_feat_out

self._X_trans_to_orig_idx = X_trans_to_orig_idx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the reverse mapping? From (0, n_features - 1) to (0, X_trans.shape[1]) ?

"""
Base class for encoders that includes the code to categorize and
class _EncoderUnion(TransformerMixin, BaseEstimator):
"""Base class for encoders that includes the code to categorize and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Base class for encoders that includes the code to categorize and
"""Base class for encoders that includes the code to encode and

"""
Base class for encoders that includes the code to categorize and
class _EncoderUnion(TransformerMixin, BaseEstimator):
"""Base class for encoders that includes the code to categorize and
transform the input features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... one by one.


return X_int, X_mask
def _hstack(self, Xs):
if any(sparse.issparse(X_tran) for X_tran in Xs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_trans let's be consistent please

if X.dtype != object and not np.all(np.sort(cats) == cats):
raise ValueError("Unsorted categories are not "
"supported for numerical categories")
diff = _encode_check_unknown(X, cats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the diff this is protected by if handle_unknown == 'error': in master.

(Not sure if relevant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_SingleOrdinalEncoder will always error since it can not handle unknown (at the moment).

self.drop_idx_ = None
elif self.drop not in self.categories_:
# This is an error state. Caller should check this value and
# handle according.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# handle according.
# handle accordingly.

@thomasjpfan thomasjpfan changed the title [MRG] CLN Encoder refactor [WIP] CLN Encoder refactor Sep 19, 2019
@thomasjpfan
Copy link
Member Author

Ok so this PR adds some overhead because of how the sparse matrices are hstack and converted to a csr when outputted. Here is a benchmark on OneHotEncoder:

from itertools import product
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from neurtu import Benchmark, delayed

n_samples = [50000, 100000, 200000, 500000]
n_features = [50, 100]

def benchmark_onehot_encoder(encoder):
    rng = np.random.RandomState(42)

    for n_sample, n_feature in product(n_samples, n_features):
        clf = encoder()

        X = rng.randint(0, 50, size=(n_sample, n_feature))
        yield delayed(clf,
                      tags={
                          'n_sample': n_sample,
                          'n_feature': n_feature,
                      }).fit_transform(X)


encoder = OneHotEncoder
bench = Benchmark(wall_time=True)
df = bench(benchmark_onehot_encoder(encoder))
print(df)

And the results:

PR

                    wall_time
n_sample n_feature           
50000    50          0.394461
         100         0.790315
100000   50          0.824488
         100         1.671744
200000   50          1.645967
         100         3.855874
500000   50          4.927292
         100        10.548770

Master

                    wall_time
n_sample n_feature           
50000    50          0.287379
         100         0.626392
100000   50          0.621225
         100         1.496794
200000   50          1.356239
         100         3.314679
500000   50          4.053212
         100         8.872062

@NicolasHug
Copy link
Member

what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices?

@thomasjpfan
Copy link
Member Author

what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices?

Doublish. ;) (As expected)

I wish there was a good way to build a sparse matrix incrementally. I have tried lil and dok but is not that great. There is another copy made when converting to csr.

Currently this PR lets the single encoders construct a CSC and then this is hstacked and converted into a csr. This is the fastest way, but still needs to allocated more memory for the stacked matrix.

I may need make a compromise to get memory usage the same as master. The _EncoderUnion will still need to know something about encoding and not just a class that hstack multiple encodings.

@jnothman
Copy link
Member

jnothman commented Oct 2, 2019 via email

@amueller amueller closed this Jan 6, 2020
@amueller
Copy link
Member

amueller commented Jan 6, 2020

@thomasjpfan told me he gave up ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants