Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add native categorical support for linear models via coordinate descent solver #18893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Nov 21, 2020 · 3 comments

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 21, 2020

Describe the workflow you want to enable

I'd like to propose native categorical support for linear models, similar to #18394, i.e. ordinal encoded columns can be specified as "categorical". I see 3 main benefits:

  • Possibly better user experience because one-hot-encoding becomes obsolete (though ordinal encoding is required as long pandas.Categorical is unsupported)
  • Possible speed up of fitting as the orthogonal design of categoricals could be exploited.
  • Possible better memory footprint in particular in combination with dense numerical features.

Describe your proposed solution

Add a new parameter categorical_features indicating which columns to treat as categoricals (ordinal encoded) as in #18394.

Then, add this functionality via a coordinate descent (begin edit) and/or newton-cholesky (end edit) solver:
Internally, do as if categoricals were one-hot-encoded (as is done for the multiclass targets in HGBT, cf. code here) and exploit the following structure:

For a feature (sub-) matrix X of a single one-hot-encoded feature and a diagonal weight matrix W, it holds that:
X.t @ W @ X = diagonal.

Thus, coordinate descent could loop in parallel over all levels/categories of this feature X, i.e. a parallelized block update.

Estimators

If only the existing coordinate descent solver is modified, then only squared error based estimators, i.e. ElasticNet and Lasso, would profit.
If a new or extended coordinate descent solver is OK, then several GLMs would also have native categorical support, i.e. LogisticRegression, PoissonRegressor, TweedieRegressor, etc. See also #16637.

@lorentzenchr
Copy link
Member Author

@agramfort @rth You might be interested.

@agramfort
Copy link
Member

agramfort commented Nov 23, 2020 via email

@lorentzenchr
Copy link
Member Author

Meanwhile, https://github.com/Quantco/glum uses this trick with https://github.com/Quantco/tabmat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants