Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

smyskoff
Copy link

@smyskoff smyskoff commented Feb 28, 2020

For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to
fit the encoder if categories list was passed to the constructor.

I faced an issue, that even though I pass categories list to a class constructor, I still forced to fit an encoder for no reason. I looked though the Internet, and just found an issue #14310, but there's still no solution for this.

For now I did changes which fits (actually, fill in self.categories_ array with contents checks.

As this code is common for both encoders, OneHotEncoder and OrdinalEncoder, I did changes in a base class, _BaseEncoder, which now has a constructor, and sets common for both encoders instance variables.

In case if categories argument, given to the constructor is not 'auto', it's expected to contain list of categories. The flow goes to actual filling the categories_ instance variable with those, given to constructor. This makes encoders' check check_is_fitted() not to fail, and transform() works with no errors.

Also I found an issue: in case if categories were of mixed types (i.e., strings and numbers, [['a', 'b'], [2, 1, 3]], both categories are supposed to be of an object type, and user was not forced to pass the numeric category ordered.

I added/modified tests to check what I added, and modified those, which fails new checks (passing 1-d list of categories).

Current (master) implementation allows passing 1-d categories array, but fails on fit() attempt.

Could you please review and comment my implementation?

For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to
fit the encoder if categories list was passed to the constructor.
@jnothman
Copy link
Member

Why do you feel it is necessary to support not fitting before transforming? From my perspective this just creates more different ways to interact with the estimators and more maintenance burden to ensure that every change is tested with every mode of interaction.

@smyskoff
Copy link
Author

smyskoff commented Mar 2, 2020

I just wondering why do I still need to fit the encoder even though I know all the categories in advance.

If you still want to force me fitting the encoder, why do we have an opportunity to pass categories as it described in documentation. Fit does exactly this job—it collects categories for given columns.

Also I found that not only am I asking this why question, and also saw, that PR is welcome. So, I just did it.

@smyskoff
Copy link
Author

smyskoff commented Mar 2, 2020

(I updated the issue number in initial message, as it was wrong)

@adrinjalali
Copy link
Member

Thanks for the work you put in this @smyskoff . We understand there are cases where calling fit may not make much sense. But doing the construct -> fit -> predict is a core part of our API (even if it's not always the best choice).

Another constraint we have is that __init__ should not set anything other than storing the given parameters, which is also violated here.

So I'm afraid we can't move forward with this.

@adrinjalali adrinjalali closed this Mar 2, 2020
@jnothman
Copy link
Member

jnothman commented Mar 2, 2020 via email

@smyskoff
Copy link
Author

smyskoff commented Mar 2, 2020

The purpose of categories constructor's argument became quite unclear for me.

Why does anyone would pass it?

I may look narrow, so correct me if I'm wrong.

I may collect categories with kind of specific tool, like pyspark, processing whole bunch of data, and it's not feasible to OHE to process it on a single machine.

When I'm done with this, I may dump collected categories and just want to use them for good with my model in the production. I can mess up with writing my own code for this, though, but I almost have an instrument within sklearn. So I pass my collected categories to OHE, and expect them to work, but yet still have to fit.

So what is an alternative usage of the categories list, which I can pass to a constructor? As a constraint during the fit?

I just saw the same issues on the forum, and messages, which encourage to write a PR.

@smyskoff
Copy link
Author

smyskoff commented Mar 2, 2020

Also, I found that there's an issue with mixed type categories.

The user is forced to pass numeric categories ordered, but this constraint does not work if one of the categories contain non-numeric values. In this case all the collected categories during the fit() will become of object dtype.

My PR contains the fix, but yet invite you to review expected behaviour.

In [1]: from sklearn.preprocessing import OneHotEncoder                                                                                                                                                                                       

In [2]: categories_allnum = [[1, 2, 3], [6, 5, 4, 3]]                                                                                                                                                                                         

In [3]: categories_mixed = [['a', 'b', 'c'], [6, 5, 4, 3]]                                                                                                                                                                                    

In [4]: X_allnum = [[1, 6], [2, 3]]                                                                                                                                                                                                           

In [5]: X_mixed = [['a', 6], ['b', 3]]                                                                                                                                                                                                        

In [6]: ohe_allnum = OneHotEncoder(categories=categories_allnum)                                                                                                                                                                              

In [7]: ohe_mixed = OneHotEncoder(categories=categories_mixed)                                                                                                                                                                                

In [8]: ohe_allnum.fit_transform(X_allnum).toarray()                                                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-a171f9f80a6a> in <module>
----> 1 ohe_allnum.fit_transform(X_allnum).toarray()

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit_transform(self, X, y)
    629                 self._categorical_features, copy=True)
    630         else:
--> 631             return self.fit(X).transform(X)
    632 
    633     def _legacy_transform(self, X):

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in fit(self, X, y)
    491             return self
    492         else:
--> 493             self._fit(X, handle_unknown=self.handle_unknown)
    494             self.drop_idx_ = self._compute_drop_idx()
    495             return self

~/.local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py in _fit(self, X, handle_unknown)
     95                 if Xi.dtype != object:
     96                     if not np.all(np.sort(cats) == cats):
---> 97                         raise ValueError("Unsorted categories are not "
     98                                          "supported for numerical categories")
     99                 if handle_unknown == 'error':

ValueError: Unsorted categories are not supported for numerical categories

In [9]: ohe_mixed.fit_transform(X_mixed).toarray()                                                                                                                                                                                            
Out[9]: 
array([[1., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1.]])

In [10]: [x.dtype for x in ohe_mixed.categories_]                                                                                                                                                                                             
Out[10]: [dtype('O'), dtype('O')]

@adrinjalali
Copy link
Member

I think @thomasjpfan may be able to comment on this one.

@jnothman
Copy link
Member

jnothman commented Mar 4, 2020 via email

@JMerrill-Fairness
Copy link

JMerrill-Fairness commented Jun 14, 2021

Here's a case which shows why the current design is so utterly dreadful.

Scenario:
Users have to process a "new" data file on the basis of structures which are fitted to "older" data file. These files are the HMDA records for a single US mortgage lender and the aggregate of all HMDA files for previous years.

Although the new datafile is of quite manageable size (a few GB), the older data file contains tens or hundreds of gigabytes of data. Users do not have the hundreds of gigabytes of memory necessary to process the older datafile.

The natural approach to this problem is to build a preprocessor based on the older data file, reconstitute it, and use the reconstituted preprocessor on the new data file. This fails.

Repro:

(On really big machine)

older_one_hot_encoder = OneHotEncoder().fit(old_dataset)

json.dump(older_one_hot_encoder.categories_, saved_category_file)

(on user machine)

older_categories = json.load(saved_category_file)

new_one_hot_encoder = OneHotEncoder(categories=older_categories)

Expect: new_one_hot_encoder.transform() to work
Observe: new_one_hot_encoder.transform fails

[ETA: I've abbreviated the JSON interface. In actual code, you need to use a list comprehension to convert the categories_ attribute before calling json.dump and also to use a list comprehension to reconstitute the results of the call to json.load into the categories argument to the new encoder.]

@CisterMoke
Copy link

Here's a case which shows why the current design is so utterly dreadful.

Scenario:
Users have to process a "new" data file on the basis of structures which are fitted to "older" data file. These files are the HMDA records for a single US mortgage lender and the aggregate of all HMDA files for previous years.

Although the new datafile is of quite manageable size (a few GB), the older data file contains tens or hundreds of gigabytes of data. Users do not have the hundreds of gigabytes of memory necessary to process the older datafile.

The natural approach to this problem is to build a preprocessor based on the older data file, reconstitute it, and use the reconstituted preprocessor on the new data file. This fails.

Repro:

(On really big machine)

older_one_hot_encoder = OneHotEncoder().fit(old_dataset)

json.dump(older_one_hot_encoder.categories_, saved_category_file)

(on user machine)

older_categories = json.load(saved_category_file)

new_one_hot_encoder = OneHotEncoder(categories=older_categories)

Expect: new_one_hot_encoder.transform() to work
Observe: new_one_hot_encoder.transform fails

[ETA: I've abbreviated the JSON interface. In actual code, you need to use a list comprehension to convert the categories_ attribute before calling json.dump and also to use a list comprehension to reconstitute the results of the call to json.load into the categories argument to the new encoder.]

From the _fit method inside the OneHotEncoder we have the following piece of code

if self.categories == 'auto':
    cats = _unique(Xi)
else:
    cats = np.array(self.categories[i], dtype=Xi.dtype)

where self.categories is set during initialization. So the fit function will take into account the categories obtained from the old data and continue to work with that.

@jnothman
Copy link
Member

jnothman commented Jul 18, 2021 via email

@adrinjalali
Copy link
Member

@CisterMoke you may also want to have a look here: #8370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants