-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Encoders: make it optional to fit if categories are given (Issue #12616) #16591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to fit the encoder if categories list was passed to the constructor.
Why do you feel it is necessary to support not fitting before transforming? From my perspective this just creates more different ways to interact with the estimators and more maintenance burden to ensure that every change is tested with every mode of interaction. |
I just wondering why do I still need to fit the encoder even though I know all the categories in advance. If you still want to force me fitting the encoder, why do we have an opportunity to pass categories as it described in documentation. Fit does exactly this job—it collects categories for given columns. Also I found that not only am I asking this why question, and also saw, that PR is welcome. So, I just did it. |
(I updated the issue number in initial message, as it was wrong) |
Thanks for the work you put in this @smyskoff . We understand there are cases where calling Another constraint we have is that So I'm afraid we can't move forward with this. |
I think there is also a risk that with more ways to configure OHE, the
validation in fit may become more complex, so identifying the exceptional
cases where no fit is needed becomes a substantial piece of logic in
itself. Then the support for such a feature here may imply that we should
provide similar support elsewhere.
I try not to exaggerate these things too much, but I do see growing burden
on the maintainers here.
|
The purpose of Why does anyone would pass it? I may look narrow, so correct me if I'm wrong. I may collect categories with kind of specific tool, like pyspark, processing whole bunch of data, and it's not feasible to OHE to process it on a single machine. When I'm done with this, I may dump collected categories and just want to use them for good with my model in the production. I can mess up with writing my own code for this, though, but I almost have an instrument within sklearn. So I pass my collected categories to OHE, and expect them to work, but yet still have to fit. So what is an alternative usage of the I just saw the same issues on the forum, and messages, which encourage to write a PR. |
Also, I found that there's an issue with mixed type categories. The user is forced to pass numeric categories ordered, but this constraint does not work if one of the categories contain non-numeric values. In this case all the collected categories during the My PR contains the fix, but yet invite you to review expected behaviour.
|
I think @thomasjpfan may be able to comment on this one. |
We allow categories to be specified in case the transformer needs to be
used in a context where not all categories may be present in the data used
to fit the estimator, or where a specific order of outputs may be needed.
Re the mixed type, please open a separate issue if there is not one already.
|
Here's a case which shows why the current design is so utterly dreadful. Scenario: Although the new datafile is of quite manageable size (a few GB), the older data file contains tens or hundreds of gigabytes of data. Users do not have the hundreds of gigabytes of memory necessary to process the older datafile. The natural approach to this problem is to build a preprocessor based on the older data file, reconstitute it, and use the reconstituted preprocessor on the new data file. This fails. Repro: (On really big machine) older_one_hot_encoder = OneHotEncoder().fit(old_dataset) json.dump(older_one_hot_encoder.categories_, saved_category_file) (on user machine) older_categories = json.load(saved_category_file) new_one_hot_encoder = OneHotEncoder(categories=older_categories) Expect: new_one_hot_encoder.transform() to work [ETA: I've abbreviated the JSON interface. In actual code, you need to use a list comprehension to convert the categories_ attribute before calling json.dump and also to use a list comprehension to reconstitute the results of the call to json.load into the categories argument to the new encoder.] |
From the
where |
Yes, you would need to call fit, but not necessarily with the same data.
Best practice: use a Pipeline.
|
@CisterMoke you may also want to have a look here: #8370 |
For both present encoders, OneHotEncoder and OrdinalEncoder make it optional to
fit the encoder if categories list was passed to the constructor.
I faced an issue, that even though I pass categories list to a class constructor, I still forced to fit an encoder for no reason. I looked though the Internet, and just found an issue #14310, but there's still no solution for this.
For now I did changes which fits (actually, fill in
self.categories_
array with contents checks.As this code is common for both encoders,
OneHotEncoder
andOrdinalEncoder
, I did changes in a base class,_BaseEncoder
, which now has a constructor, and sets common for both encoders instance variables.In case if
categories
argument, given to the constructor is not'auto'
, it's expected to contain list of categories. The flow goes to actual filling thecategories_
instance variable with those, given to constructor. This makes encoders' checkcheck_is_fitted()
not to fail, andtransform()
works with no errors.Also I found an issue: in case if categories were of mixed types (i.e., strings and numbers,
[['a', 'b'], [2, 1, 3]]
, both categories are supposed to be of anobject
type, and user was not forced to pass the numeric category ordered.I added/modified tests to check what I added, and modified those, which fails new checks (passing 1-d list of categories).
Current (master) implementation allows passing 1-d categories array, but fails on
fit()
attempt.Could you please review and comment my implementation?