IterativeImputer shouldn't just use l2 loss by default #13286

jnothman · 2019-02-26T15:28:04Z

@GaelVaroquaux points out that iterative imputation with a regularised least-squares model is more-or-less the same as using NMF for imputation. We should instead use RandomForestRegressor as the default regressor in IterativeImputer, at least if sample_posterior=False (or we can implement predict(return_std=True) on RandomForestRegressor!).

amueller · 2019-02-26T17:05:49Z

NMF? why? Sorry I don't follow.

sergeyf · 2019-03-02T01:03:32Z

It'll be way slower with RF and perhaps not better!

Is it identical to NMF? Or just similar? I'd love to see an example.

GaelVaroquaux · 2019-03-02T06:29:01Z

I think NMF was a typo. Low rank matrix factorization was what was meant. It would be slower, but if we want a fast imputer, we should code a low rank matrix factorisation. ⁣Sent from my phone. Please forgive typos and briefness.

…

On Mar 2, 2019, 02:04, at 02:04, Sergey Feldman ***@***.***> wrote: It'll be way slower with RF and perhaps not better! Is it identical to NMF? Or just similar? I'd love to see an example. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #13286 (comment)

jnothman · 2019-03-02T21:38:58Z

Sorry, not NMF, just MF. Jotted this down in a spare moment so as to not lose your comment on the defaults, Gaël.

…

sergeyf · 2019-03-03T01:19:03Z

What's the effective rank when doing IterativeImputer? Is there an equivalence proof somewhere or empirical results I can look at showing this equivalence? On Fri, Mar 1, 2019, 8:29 PM Gael Varoquaux <[email protected]> wrote:

…

I think NMF was a typo. Low rank matrix factorization was what was meant. It would be slower, but if we want a fast imputer, we should code a low rank matrix factorisation. ⁣Sent from my phone. Please forgive typos and briefness. On Mar 2, 2019, 02:04, at 02:04, Sergey Feldman ***@***.***> wrote: >It'll be way slower with RF and perhaps not better! > >Is it identical to NMF? Or just similar? I'd love to see an example. > >-- >You are receiving this because you were mentioned. >Reply to this email directly or view it on GitHub: > #13286 (comment) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13286 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABya7IDi7etzzRNhrlsT9FqOUDi2gV-vks5vShpUgaJpZM4bScFx> .

jnothman · 2019-07-04T00:24:26Z

We should consider before 0.22 what a better default estimator is for IterativeImputer.

sergeyf · 2019-07-04T01:55:33Z

What's wrong with BayesianRidge again? I strongly believe a linear estimator is the best default.

jnothman · 2019-07-04T02:27:44Z

One issue was speed (without sample_posterior), the other was the concern here that it is an inefficient way to effectively do nmf if I understand correctly.

sergeyf · 2019-07-04T03:45:20Z

I would love to see some experiments or simulations or theory showing the MF equivalence...

…

On Wed, Jul 3, 2019, 7:29 PM Joel Nothman ***@***.***> wrote: One issue was speed (without sample_posterior), the other was the concern here that it is an inefficient way to effectively do nmf if I understand correctly. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13286?email_source=notifications&email_token=AAOJV3GMNARQDAUD32W7CJTP5VOAFA5CNFSM4G2JYFY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZGEJSQ#issuecomment-508314826>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOJV3CNGNK5JZH4MNGFQK3P5VOAFANCNFSM4G2JYFYQ> .

GaelVaroquaux · 2019-07-04T06:08:26Z

I strongly believe a linear estimator is the best default.

If we want linear conditional imputation, we should be using low-rank matrix factorization. MissForest, the R package implementing conditional imputation based on forest, is one of the packages that works best as a black-box method. I think that this is what the IterativeImputer should be aiming to mimick.

sergeyf · 2019-07-04T11:42:30Z

I would appreciate seeing some performance numbers comparing BayesianRidge vs RandomForest vs ExtraTrees as estimators in terms of both quality and also running time. Speed is pretty important for defaults, and we'd want to know what kind of gains are possible for the extra time paid. Do you two agree? In case we decide to keep linear as default. One option would be to run RidgeCV for one round, and freeze reg params chosen for each column. Should stop the odd jumpiness we observed. I personally think a more important addition for 0.22 is to add a classifier for the categorical columns.

…

On Wed, Jul 3, 2019, 11:10 PM Gael Varoquaux ***@***.***> wrote: > I strongly believe a linear estimator is the best default. If we want linear conditional imputation, we should be using low-rank matrix factorization. MissForest, the R package implementing conditional imputation based on forest, is one of the packages that works best as a black-box method. I think that this is what the IterativeImputer should be aiming to mimick. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13286?email_source=notifications&email_token=AAOJV3APLJOVKK5CZ5GOTV3P5WH4FA5CNFSM4G2JYFY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZGNJOA#issuecomment-508351672>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOJV3BAL36XN4A6R3WDZUDP5WH4FANCNFSM4G2JYFYQ> .

jnothman mentioned this issue Mar 5, 2020

Any plans to stabilize IterativeImputer? What are the current roadblocks to doing so? #16638

Closed

cmarmo added module:impute Needs Benchmarks A tag for the issues and PRs which require some benchmarks labels Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterativeImputer shouldn't just use l2 loss by default #13286

IterativeImputer shouldn't just use l2 loss by default #13286

jnothman commented Feb 26, 2019

amueller commented Feb 26, 2019

sergeyf commented Mar 2, 2019

GaelVaroquaux commented Mar 2, 2019 via email

jnothman commented Mar 2, 2019 via email

sergeyf commented Mar 3, 2019 via email

jnothman commented Jul 4, 2019

sergeyf commented Jul 4, 2019

jnothman commented Jul 4, 2019 via email

sergeyf commented Jul 4, 2019 via email

GaelVaroquaux commented Jul 4, 2019 via email

sergeyf commented Jul 4, 2019 via email

IterativeImputer shouldn't just use l2 loss by default #13286

IterativeImputer shouldn't just use l2 loss by default #13286

Comments

jnothman commented Feb 26, 2019

amueller commented Feb 26, 2019

sergeyf commented Mar 2, 2019

GaelVaroquaux commented Mar 2, 2019 via email

jnothman commented Mar 2, 2019 via email

sergeyf commented Mar 3, 2019 via email

jnothman commented Jul 4, 2019

sergeyf commented Jul 4, 2019

jnothman commented Jul 4, 2019 via email

sergeyf commented Jul 4, 2019 via email

GaelVaroquaux commented Jul 4, 2019 via email

sergeyf commented Jul 4, 2019 via email