[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

smyskoff · 2020-02-28T16:43:42Z

For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to
fit the encoder if categories list was passed to the constructor.

I faced an issue, that even though I pass categories list to a class constructor, I still forced to fit an encoder for no reason. I looked though the Internet, and just found an issue #14310, but there's still no solution for this.

For now I did changes which fits (actually, fill in self.categories_ array with contents checks.

As this code is common for both encoders, OneHotEncoder and OrdinalEncoder, I did changes in a base class, _BaseEncoder, which now has a constructor, and sets common for both encoders instance variables.

In case if categories argument, given to the constructor is not 'auto', it's expected to contain list of categories. The flow goes to actual filling the categories_ instance variable with those, given to constructor. This makes encoders' check check_is_fitted() not to fail, and transform() works with no errors.

Also I found an issue: in case if categories were of mixed types (i.e., strings and numbers, [['a', 'b'], [2, 1, 3]], both categories are supposed to be of an object type, and user was not forced to pass the numeric category ordered.

I added/modified tests to check what I added, and modified those, which fails new checks (passing 1-d list of categories).

Current (master) implementation allows passing 1-d categories array, but fails on fit() attempt.

Could you please review and comment my implementation?

For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to fit the encoder if categories list was passed to the constructor.

jnothman · 2020-02-29T22:13:11Z

Why do you feel it is necessary to support not fitting before transforming? From my perspective this just creates more different ways to interact with the estimators and more maintenance burden to ensure that every change is tested with every mode of interaction.

smyskoff · 2020-03-02T08:20:04Z

I just wondering why do I still need to fit the encoder even though I know all the categories in advance.

If you still want to force me fitting the encoder, why do we have an opportunity to pass categories as it described in documentation. Fit does exactly this job—it collects categories for given columns.

Also I found that not only am I asking this why question, and also saw, that PR is welcome. So, I just did it.

smyskoff · 2020-03-02T08:23:29Z

(I updated the issue number in initial message, as it was wrong)

adrinjalali · 2020-03-02T10:48:56Z

Thanks for the work you put in this @smyskoff . We understand there are cases where calling fit may not make much sense. But doing the construct -> fit -> predict is a core part of our API (even if it's not always the best choice).

Another constraint we have is that __init__ should not set anything other than storing the given parameters, which is also violated here.

So I'm afraid we can't move forward with this.

jnothman · 2020-03-02T19:44:43Z

I think there is also a risk that with more ways to configure OHE, the validation in fit may become more complex, so identifying the exceptional cases where no fit is needed becomes a substantial piece of logic in itself. Then the support for such a feature here may imply that we should provide similar support elsewhere. I try not to exaggerate these things too much, but I do see growing burden on the maintainers here.

smyskoff · 2020-03-02T21:06:18Z

The purpose of categories constructor's argument became quite unclear for me.

Why does anyone would pass it?

I may look narrow, so correct me if I'm wrong.

I may collect categories with kind of specific tool, like pyspark, processing whole bunch of data, and it's not feasible to OHE to process it on a single machine.

When I'm done with this, I may dump collected categories and just want to use them for good with my model in the production. I can mess up with writing my own code for this, though, but I almost have an instrument within sklearn. So I pass my collected categories to OHE, and expect them to work, but yet still have to fit.

So what is an alternative usage of the categories list, which I can pass to a constructor? As a constraint during the fit?

I just saw the same issues on the forum, and messages, which encourage to write a PR.

smyskoff · 2020-03-02T21:17:59Z

Also, I found that there's an issue with mixed type categories.

The user is forced to pass numeric categories ordered, but this constraint does not work if one of the categories contain non-numeric values. In this case all the collected categories during the fit() will become of object dtype.

My PR contains the fix, but yet invite you to review expected behaviour.

In [1]: from sklearn.preprocessing import OneHotEncoder                                                                                                                                                                                       

In [2]: categories_allnum = [[1, 2, 3], [6, 5, 4, 3]]                                                                                                                                                                                         

In [3]: categories_mixed = [['a', 'b', 'c'], [6, 5, 4, 3]]                                                                                                                                                                                    

In [4]: X_allnum = [[1, 6], [2, 3]]                                                                                                                                                                                                           

In [5]: X_mixed = [['a', 6], ['b', 3]]                                                                                                                                                                                                        

In [6]: ohe_allnum = OneHotEncoder(categories=categories_allnum)                                                                                                                                                                              

In [7]: ohe_mixed = OneHotEncoder(categories=categories_mixed)                                                                                                                                                                                

In [8]: ohe_allnum.fit_transform(X_allnum).toarray()                                                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-a171f9f80a6a> in <module>
----> 1 ohe_allnum.fit_transform(X_allnum).toarray()

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
    629                 self._categorical_features, copy=True)
    630         else:
--> 631             return self.fit(X).transform(X)
    632 
    633     def _legacy_transform(self, X):

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    491             return self
    492         else:
--> 493             self._fit(X, handle_unknown=self.handle_unknown)
    494             self.drop_idx_ = self._compute_drop_idx()
    495             return self

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
     95                 if Xi.dtype != object:
     96                     if not np.all(np.sort(cats) == cats):
---> 97                         raise ValueError("Unsorted categories are not "
     98                                          "supported for numerical categories")
     99                 if handle_unknown == 'error':

ValueError: Unsorted categories are not supported for numerical categories

In [9]: ohe_mixed.fit_transform(X_mixed).toarray()                                                                                                                                                                                            
Out[9]: 
array([[1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1.]])

In [10]: [x.dtype for x in ohe_mixed.categories_]                                                                                                                                                                                             
Out[10]: [dtype('O'), dtype('O')]

adrinjalali · 2020-03-03T10:14:43Z

I think @thomasjpfan may be able to comment on this one.

jnothman · 2020-03-04T22:20:08Z

We allow categories to be specified in case the transformer needs to be used in a context where not all categories may be present in the data used to fit the estimator, or where a specific order of outputs may be needed. Re the mixed type, please open a separate issue if there is not one already.

JMerrill-Fairness · 2021-06-14T17:34:26Z

Here's a case which shows why the current design is so utterly dreadful.

Scenario:
Users have to process a "new" data file on the basis of structures which are fitted to "older" data file. These files are the HMDA records for a single US mortgage lender and the aggregate of all HMDA files for previous years.

Although the new datafile is of quite manageable size (a few GB), the older data file contains tens or hundreds of gigabytes of data. Users do not have the hundreds of gigabytes of memory necessary to process the older datafile.

The natural approach to this problem is to build a preprocessor based on the older data file, reconstitute it, and use the reconstituted preprocessor on the new data file. This fails.

Repro:

(On really big machine)

older_one_hot_encoder = OneHotEncoder().fit(old_dataset)

json.dump(older_one_hot_encoder.categories_, saved_category_file)

(on user machine)

older_categories = json.load(saved_category_file)

new_one_hot_encoder = OneHotEncoder(categories=older_categories)

Expect: new_one_hot_encoder.transform() to work
Observe: new_one_hot_encoder.transform fails

[ETA: I've abbreviated the JSON interface. In actual code, you need to use a list comprehension to convert the categories_ attribute before calling json.dump and also to use a list comprehension to reconstitute the results of the call to json.load into the categories argument to the new encoder.]

CisterMoke · 2021-07-17T19:40:38Z

Here's a case which shows why the current design is so utterly dreadful.

Scenario:
Users have to process a "new" data file on the basis of structures which are fitted to "older" data file. These files are the HMDA records for a single US mortgage lender and the aggregate of all HMDA files for previous years.

Although the new datafile is of quite manageable size (a few GB), the older data file contains tens or hundreds of gigabytes of data. Users do not have the hundreds of gigabytes of memory necessary to process the older datafile.

The natural approach to this problem is to build a preprocessor based on the older data file, reconstitute it, and use the reconstituted preprocessor on the new data file. This fails.

Repro:

(On really big machine)

older_one_hot_encoder = OneHotEncoder().fit(old_dataset)

json.dump(older_one_hot_encoder.categories_, saved_category_file)

(on user machine)

older_categories = json.load(saved_category_file)

new_one_hot_encoder = OneHotEncoder(categories=older_categories)

Expect: new_one_hot_encoder.transform() to work
Observe: new_one_hot_encoder.transform fails

[ETA: I've abbreviated the JSON interface. In actual code, you need to use a list comprehension to convert the categories_ attribute before calling json.dump and also to use a list comprehension to reconstitute the results of the call to json.load into the categories argument to the new encoder.]

From the _fit method inside the OneHotEncoder we have the following piece of code

if self.categories == 'auto':
    cats = _unique(Xi)
else:
    cats = np.array(self.categories[i], dtype=Xi.dtype)

where self.categories is set during initialization. So the fit function will take into account the categories obtained from the old data and continue to work with that.

jnothman · 2021-07-18T02:39:06Z

Yes, you would need to call fit, but not necessarily with the same data. Best practice: use a Pipeline.

adrinjalali · 2021-07-20T13:55:40Z

@CisterMoke you may also want to have a look here: #8370

smyskoff added 2 commits February 28, 2020 18:14

Encoder: make it optional to fit if categories are given

4faafd2

For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to fit the encoder if categories list was passed to the constructor.

Added tests to reach *Encoder test coverage

56486fd

adrinjalali closed this Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

Uh oh!

smyskoff commented Feb 28, 2020 •

edited

Loading

Uh oh!

jnothman commented Feb 29, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

adrinjalali commented Mar 2, 2020

Uh oh!

jnothman commented Mar 2, 2020 via email

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

adrinjalali commented Mar 3, 2020

Uh oh!

jnothman commented Mar 4, 2020 via email

Uh oh!

JMerrill-Fairness commented Jun 14, 2021 •

edited

Loading

Uh oh!

CisterMoke commented Jul 17, 2021

Uh oh!

jnothman commented Jul 18, 2021 via email

Uh oh!

adrinjalali commented Jul 20, 2021

Uh oh!

Uh oh!

Uh oh!

[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

Uh oh!

Conversation

smyskoff commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 29, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

adrinjalali commented Mar 2, 2020

Uh oh!

jnothman commented Mar 2, 2020 via email

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

smyskoff commented Mar 2, 2020

Uh oh!

adrinjalali commented Mar 3, 2020

Uh oh!

jnothman commented Mar 4, 2020 via email

Uh oh!

JMerrill-Fairness commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CisterMoke commented Jul 17, 2021

Uh oh!

jnothman commented Jul 18, 2021 via email

Uh oh!

adrinjalali commented Jul 20, 2021

Uh oh!

Uh oh!

smyskoff commented Feb 28, 2020 •

edited

Loading

JMerrill-Fairness commented Jun 14, 2021 •

edited

Loading