[WIP] CLN Encoder refactor #14972

thomasjpfan · 2019-09-13T00:01:09Z

This PR refactors _encoders.py as follows:

Introduces _SingleOneHotEncoder and _SingleOrdinalEncoder to encode a single categorical feature. They implement, fit, transform and inverse_transform. transform and fit operates on X of shape (n_samples,).
Introduces _EncoderUnion that provides utilities for child classes to call, which will distribute the features to the encoders:

_check_X does validation checks and returns a list of features, X_list
_fit_list takes a list of encoders and sends the elements of X_list to their respected encoders to be fitted.
_transform_list takes a list of X_list and sends the elements of X_list to their respected encoders to be transformed.
inverse_transform is defined for all encoders.

OneHotEncoder and OrdinalEncoder does input validation and uses the utilities provided by _EncoderUnion.

The motivation of this design is to separate the handling of distributing feature columns and encoding. When we introduce more encoders in the future, we would need to implement a child class of _EncoderUnion and a _Single***Encoder. This refactor should also make it easier to implement more features into our encoders such as #14954.

CC: @NicolasHug

jnothman · 2019-09-13T00:47:38Z

This refactor should also make it easier to implement more features into our encoders such as #14954.

Does it mean that the user would get one warning for each column??

thomasjpfan · 2019-09-13T02:21:59Z

Does it mean that the user would get one warning for each column??

This refactor was done in a way to have the caller do the warning handling. The pattern is to change an attribute in _SingleOneHotEncoder that the caller will look for if there is a exception to throw. For example, in _SingleOneHotEncoder, drop_idx_ is set to missing if drop does not appear in categories_. This way the caller, OneHotEncoder, can check the attribute and raise an error container all the categories that error.

thomasjpfan · 2019-09-13T02:37:56Z

The public interface is exactly the same as we have on master.

On a side note, I considered having the _SingleOneHotEncoder raise an exception so the caller can gather the exceptions. But without using custom exceptions, it would be hard for the caller to distinguish between exceptions. Using the string from the exception seems a little too hacky. So this PR implements a procedure to "set an attribute if the _Single***Encoder is in a bad state". It is kind of similar to fit_status_ in our SVMs.

thomasjpfan · 2019-09-13T02:40:10Z

I am hoping this makes it much easier to implement encoders or at least try out different encoders. Most of OneHotEncoder and OrdinalEncoder is doing input validation, all the heavy work is done by _Single***Encoder and it has the restricted domain of thinking about a single categorical feature.

glemaitre

Do we get any regression regarding the performance?

glemaitre · 2019-09-13T08:25:04Z

sklearn/preprocessing/_encoders.py

@@ -52,103 +61,112 @@ def _check_X(self, X):
            # to keep the dtype information to be used in the encoder.
            needs_validation = True

-        n_samples, n_features = X.shape
+        n_features = X.shape[1]


You can remove this variable

glemaitre · 2019-09-13T08:25:20Z

sklearn/preprocessing/_encoders.py

@@ -52,103 +61,112 @@ def _check_X(self, X):
            # to keep the dtype information to be used in the encoder.
            needs_validation = True

-        n_samples, n_features = X.shape
+        n_features = X.shape[1]
        X_columns = []

        for i in range(n_features):


n_features => X.shape[1]

glemaitre · 2019-09-13T08:27:35Z

sklearn/preprocessing/_encoders.py

-        X_mask = np.ones((n_samples, n_features), dtype=np.bool)
+    def _fit_list(self, X_list, encoders):
+        """Fit encoders on X_list"""
+        assert len(X_list) == len(encoders)


I would add an additional message in case we raise the AssertionError. It will not be really friendly.

glemaitre · 2019-09-13T08:28:20Z

sklearn/preprocessing/_encoders.py


+        # map from X_trans indicies to indicies from the original X


glemaitre · 2019-09-13T08:32:08Z

sklearn/preprocessing/_encoders.py

        if n_features != len(self.categories_):
            raise ValueError(
                "The number of features in X is different to the number of "
                "features of the fitted data. The fitted data had {} features "
                "and the X has {} features."
-                .format(len(self.categories_,), n_features)
-            )
+                .format(len(self.categories_,), n_features))


you can revert this change. Black style is fine :)

glemaitre · 2019-09-13T08:45:17Z

sklearn/preprocessing/_encoders.py

+        categories = self._check_categories(n_features)
+        drop_kwargs = self._check_drop(n_features)
+
+        encoders = [


the pattern

encoders = [...] self._fit_list(...)

will be common to each encoder. Would it make sense to have an abstract method across encoder which should be defined in all children?

Then _fit will just be an input validation (which could also be an abstract method) and the above pattern (+ additinonal thing depending of the encoder)

def _fit(self, X): X = self._validate_input(X) self._generate_and_fit_encoder(X) ...

But be aware that this is not clear to me :)

I am glad you noticed this! One of my designs had something similar where _fit was an abstract method. It was even possible to get to a point where the child class only needed to define _fit and _EncoderUnion can define fit, transform, and fit_transform. It works, but it felt a little too magically. (There were additional lines to overwrite the docstring for transform, and fit_transform for a specific child encoder. For example, OneHotEncoder.transform can output sparse matrices.)

The benefits of this PR, is that OrdinalEncoder.fit explicitly defines the encoders, and explicitly passes it up into the parent class to process.

NicolasHug

Mostly nits about style for now (sorry)

NicolasHug · 2019-09-15T22:47:14Z

sklearn/preprocessing/_encoders.py

+
+        return categories
+
+    def _fit_list(self, X_list, encoders):


I think it would help a bit to call all single encoders single_encoder instead of encoder.

Also fit_list isn't super descriptive. fit_single_encoders maybe?

NicolasHug · 2019-09-15T22:48:27Z

sklearn/preprocessing/_encoders.py

+        for X_col, encoder in zip(X_list, encoders):
+            encoder.fit(X_col)
+
+        # map from X_trans indices to indices from the original X


Suggested change

# map from X_trans indices to indices from the original X

# map X_trans indices to indices from the original X

Also, please indicate why this is needed

NicolasHug · 2019-09-15T22:48:54Z

sklearn/preprocessing/_encoders.py

-            Xi = X_list[i]
-            diff, valid_mask = _encode_check_unknown(Xi, self.categories_[i],
-                                                     return_mask=True)
+        X_trs = []


NicolasHug · 2019-09-15T22:49:39Z

sklearn/preprocessing/_encoders.py

+        X_trans_to_orig_idx = []
+        X_trans_idx = 0
+        for encoder in encoders:
+            n_feat_out = encoder.n_features_out_


n_features_out...

NicolasHug · 2019-09-15T22:53:29Z

sklearn/preprocessing/_encoders.py

+            X_trans_to_orig_idx.append((begin, end))
+            X_trans_idx += n_feat_out
+
+        self._X_trans_to_orig_idx = X_trans_to_orig_idx


Isn't this the reverse mapping? From (0, n_features - 1) to (0, X_trans.shape[1]) ?

NicolasHug · 2019-09-15T22:54:19Z

sklearn/preprocessing/_encoders.py

-    """
-    Base class for encoders that includes the code to categorize and
+class _EncoderUnion(TransformerMixin, BaseEstimator):
+    """Base class for encoders that includes the code to categorize and


Suggested change

"""Base class for encoders that includes the code to categorize and

"""Base class for encoders that includes the code to encode and

NicolasHug · 2019-09-15T22:54:28Z

sklearn/preprocessing/_encoders.py

-    """
-    Base class for encoders that includes the code to categorize and
+class _EncoderUnion(TransformerMixin, BaseEstimator):
+    """Base class for encoders that includes the code to categorize and
    transform the input features.


... one by one.

NicolasHug · 2019-09-15T22:59:16Z

sklearn/preprocessing/_encoders.py

-
-        return X_int, X_mask
+    def _hstack(self, Xs):
+        if any(sparse.issparse(X_tran) for X_tran in Xs):


X_trans let's be consistent please

NicolasHug · 2019-09-15T23:07:33Z

sklearn/preprocessing/_encoders.py

+            if X.dtype != object and not np.all(np.sort(cats) == cats):
+                raise ValueError("Unsorted categories are not "
+                                 "supported for numerical categories")
+            diff = _encode_check_unknown(X, cats)


Looking at the diff this is protected by if handle_unknown == 'error': in master.

(Not sure if relevant)

_SingleOrdinalEncoder will always error since it can not handle unknown (at the moment).

NicolasHug · 2019-09-15T23:08:52Z

sklearn/preprocessing/_encoders.py

+            self.drop_idx_ = None
+        elif self.drop not in self.categories_:
+            # This is an error state. Caller should check this value and
+            # handle according.


Suggested change

# handle according.

# handle accordingly.

thomasjpfan · 2019-09-25T19:23:41Z

Ok so this PR adds some overhead because of how the sparse matrices are hstack and converted to a csr when outputted. Here is a benchmark on OneHotEncoder:

from itertools import product
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from neurtu import Benchmark, delayed

n_samples = [50000, 100000, 200000, 500000]
n_features = [50, 100]

def benchmark_onehot_encoder(encoder):
    rng = np.random.RandomState(42)

    for n_sample, n_feature in product(n_samples, n_features):
        clf = encoder()

        X = rng.randint(0, 50, size=(n_sample, n_feature))
        yield delayed(clf,
                      tags={
                          'n_sample': n_sample,
                          'n_feature': n_feature,
                      }).fit_transform(X)


encoder = OneHotEncoder
bench = Benchmark(wall_time=True)
df = bench(benchmark_onehot_encoder(encoder))
print(df)

And the results:

PR

                    wall_time
n_sample n_feature           
50000    50          0.394461
         100         0.790315
100000   50          0.824488
         100         1.671744
200000   50          1.645967
         100         3.855874
500000   50          4.927292
         100        10.548770

Master

                    wall_time
n_sample n_feature           
50000    50          0.287379
         100         0.626392
100000   50          0.621225
         100         1.496794
200000   50          1.356239
         100         3.314679
500000   50          4.053212
         100         8.872062

NicolasHug · 2019-09-25T19:28:25Z

what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices?

thomasjpfan · 2019-09-28T20:00:16Z

what about the memory issue of creating a big sparse matrix about multiple smaller sparse matrices?

Doublish. ;) (As expected)

I wish there was a good way to build a sparse matrix incrementally. I have tried lil and dok but is not that great. There is another copy made when converting to csr.

Currently this PR lets the single encoders construct a CSC and then this is hstacked and converted into a csr. This is the fastest way, but still needs to allocated more memory for the stacked matrix.

I may need make a compromise to get memory usage the same as master. The _EncoderUnion will still need to know something about encoding and not just a class that hstack multiple encodings.

jnothman · 2019-10-02T01:15:31Z

I think you could probably engineer some kind of CSCBuilder especially since here you know the maximum nnz. But I'd consider this premature optimisation if the rest of the refactor is clearly beneficial. (I've not reviewed)

amueller · 2020-01-06T21:18:29Z

@thomasjpfan told me he gave up ;)

thomasjpfan added 3 commits September 12, 2019 19:39

CLN Refactors encoders

9703dd2

DOC Add note regarding onehotencoder

7cb752c

CLN Copy only when needed

7a2f01d

thomasjpfan changed the title ~~[MRG] CLN Encoder refactor v2~~ [WIP] CLN Encoder refactor v2 Sep 13, 2019

thomasjpfan added 2 commits September 12, 2019 22:26

BUG Fix linux py35 bug

c5d3351

BUG Fix Windows 32 bug

ebc4c74

thomasjpfan changed the title ~~[WIP] CLN Encoder refactor v2~~ [MRG] CLN Encoder refactor Sep 13, 2019

glemaitre reviewed Sep 13, 2019

View reviewed changes

thomasjpfan added 2 commits September 13, 2019 11:04

Merge remote-tracking branch 'upstream/master' into encoder_refactor_v3

30fe1a7

CLN Address @glemaitre comments

dcf2504

NicolasHug reviewed Sep 15, 2019

View reviewed changes

thomasjpfan changed the title ~~[MRG] CLN Encoder refactor~~ [WIP] CLN Encoder refactor Sep 19, 2019

thomasjpfan added 2 commits September 19, 2019 15:05

BUG Improves performance

61903fd

ENH Matches performance with master

93c594b

NicolasHug mentioned this pull request Sep 20, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

ENH Minor optimizations

56b4db6

amueller closed this Jan 6, 2020

	# map from X_trans indices to indices from the original X
	# map X_trans indices to indices from the original X

	"""Base class for encoders that includes the code to categorize and
	"""Base class for encoders that includes the code to encode and

Uh oh!

[WIP] CLN Encoder refactor #14972

[WIP] CLN Encoder refactor #14972

Uh oh!

Conversation

thomasjpfan commented Sep 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 13, 2019

Uh oh!

thomasjpfan commented Sep 13, 2019

Uh oh!

thomasjpfan commented Sep 13, 2019

Uh oh!

thomasjpfan commented Sep 13, 2019

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Sep 25, 2019

PR

Master

Uh oh!

NicolasHug commented Sep 25, 2019

Uh oh!

thomasjpfan commented Sep 28, 2019

Uh oh!

jnothman commented Oct 2, 2019 via email

Uh oh!

amueller commented Jan 6, 2020

Uh oh!

Uh oh!

thomasjpfan commented Sep 13, 2019 •

edited

Loading