Add strategy="random" to SimpleImputer #11209

ogrisel · 2018-06-06T09:05:56Z

The purpose would be to provide an unbiased, stochastic, univariate imputer. The imputer would replace missing values by values sampled uniformly at random from the non-missing values of the sample column.

This would require to add additional random_state constructor parameter to the class.

As for #11208 this strategy should work both on numerical and non-numerical dtypes.

The text was updated successfully, but these errors were encountered:

ogrisel · 2018-06-06T09:07:02Z

Maybe this should be named strategy="sample".

jnothman · 2018-06-15T01:48:40Z

This also requires not using statistics_ at transform time, so I suspect it should be a different class

jeremiedbb · 2018-06-15T14:14:32Z

Since I'm currently working on the SimpleImputer, I'd like to help on that.
Right now, statistics_ stores 1 statistic per column since the implemented strategies only need to impute a constant value in each column. This strategy is therefor quite different from the other ones. However, it's still a simple strategy, compared to the MICEImputer.
We can make a new class, say SampleImputer (maybe to close to SimpleImputer ?).
Another solution could be to store the induced probability distribution in statistics_ and make a special treatment for "sample" strategy at transform time.

jnothman · 2018-06-16T10:07:41Z

I'd prefer SamplingImputer to SimpleImputer!

jnothman · 2018-06-16T10:08:51Z

@RianneSchouten also volunteered to contribute this: #8478 (comment)

jnothman · 2018-06-24T15:00:07Z

Ideas for tests. For any single-column training data X with non-zero nanvar, the following should hold:

values = np.unique(X)
values = values[~_get_mask(values, missing_values)]

Xts = []
for i in range(100):
    Xt = imputer.set_params(random_state=i).fit_transform(X)
    assert_array_equal(values, np.unique(Xt))
    Xts.append(Xt)

assert not np.allclose(Xts[-1], Xts[-2])  # or more strictly, some indices should fail np.allclose
assert np.mean(np.concatenate(Xts)) == pytest.approx(np.nanmean(X))

Behaviour at test time (i.e. transform might be harder to define) because we don't have many stochastic transformers. If a random seed is fixed, should every call to transform with the same data produce the same imputation? This would imply that fit_transform and transform are likely to draw the same samples. I'm not sure how problematic that is.

PGryllos · 2018-06-26T12:36:07Z

Is someone working actively on this? Otherwise can I take a look?

jeremiedbb · 2018-06-26T13:00:28Z

I thought @RianneSchouten was working on it, so I decided to step back, but I already have some code for that. Now that it seems that she's not actively working on that, I'll make a PR soon.

RianneSchouten · 2018-06-26T17:02:37Z

I don't mind whoever does it. I am quite busy so if you can are faster with making a PR, don't hesitate.

jeremiedbb · 2018-06-27T10:21:23Z

I made a PR to add this feature. Since I was not sure about new strategy vs new class, I made both. Tell me which one you think is best.

amueller · 2019-06-10T21:52:06Z

I agree on adding a different class and by default use binning.

mokeddembillel · 2020-11-24T22:25:43Z

Hi guys, is any of you still working on this issue or not, i want to help

thomasjpfan · 2020-11-24T22:55:05Z

There is discussing surrounding this issue in #11368

At a high level, it looks like the solution needs benchmarking to see how it performs when compared to other imputing strategies.

mokeddembillel · 2020-11-25T16:22:29Z

@thomasjpfan Thanks for letting me know, i'll try to catch up

Jose-Verdu-Diaz · 2023-05-04T13:03:31Z

Any updates on this? I would love to use random imputation as a baseline for comparing the performance of other imputation methods. This would be equivalent to DummyClassifier with strategy="uniform".

I don't think this is trivial, as the imputer should ideally:

Support both categorical and numerical
Support both continuous and discrete numerical variables (this could be fixed after imputation by the end-user)
Support randomly sampling from a list of user-provided values
Support sampling from a uniform distribution with a min/max value provided by the user

Probably the easier and faster way to implement this and provide minimum functionality would be sampling from a uniform distribution where the min/max values are the min/max non-missing values, column-wise. The end-user can then round the values to discretize them, if needed

qinhanmin2014 mentioned this issue Jun 15, 2018

New feature: imputer which samples from present values in the same column #11268

Closed

jnothman added New Feature Easy Well-defined and straightforward way to resolve help wanted labels Jun 15, 2018

jnothman mentioned this issue Jun 16, 2018

[MRG+2] Basic version of MICE Imputation #8478

Merged

jnothman mentioned this issue Jun 26, 2018

[MRG] Add decision threshold calibration wrapper #10117

Closed

jeremiedbb mentioned this issue Jun 27, 2018

[MRG] New feature: SamplingImputer #11368

Open

3 tasks

banilo mentioned this issue Feb 25, 2019

SimpleImputer: Add strategy="random_column" #13247

Closed

jnothman mentioned this issue Jun 10, 2019

Add Sampling Imputer sampling from training data #14060

Closed

amueller removed the help wanted label Jul 12, 2019

snosrap mentioned this issue Mar 30, 2021

SimpleImputer -- per-column constant values #19783

Closed

cmarmo removed the Easy Well-defined and straightforward way to resolve label Apr 23, 2021

cmarmo added Needs Decision - Include Feature Requires decision regarding including feature module:impute labels Jan 15, 2022

thomasjpfan mentioned this issue May 2, 2023

SimpleImputer.strategy = 'random' #26310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add strategy="random" to SimpleImputer #11209

Add strategy="random" to SimpleImputer #11209

ogrisel commented Jun 6, 2018

ogrisel commented Jun 6, 2018 •

edited

Loading

Uh oh!

jnothman commented Jun 15, 2018

Uh oh!

jeremiedbb commented Jun 15, 2018

Uh oh!

jnothman commented Jun 16, 2018

Uh oh!

jnothman commented Jun 16, 2018

Uh oh!

jnothman commented Jun 24, 2018

Uh oh!

PGryllos commented Jun 26, 2018 •

edited

Loading

Uh oh!

jeremiedbb commented Jun 26, 2018

Uh oh!

RianneSchouten commented Jun 26, 2018

Uh oh!

jeremiedbb commented Jun 27, 2018

Uh oh!

amueller commented Jun 10, 2019

Uh oh!

mokeddembillel commented Nov 24, 2020

Uh oh!

thomasjpfan commented Nov 24, 2020

Uh oh!

mokeddembillel commented Nov 25, 2020

Uh oh!

Jose-Verdu-Diaz commented May 4, 2023

Uh oh!

Uh oh!

Add strategy="random" to SimpleImputer #11209

Add strategy="random" to SimpleImputer #11209

Comments

ogrisel commented Jun 6, 2018

ogrisel commented Jun 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jun 15, 2018

Uh oh!

jeremiedbb commented Jun 15, 2018

Uh oh!

jnothman commented Jun 16, 2018

Uh oh!

jnothman commented Jun 16, 2018

Uh oh!

jnothman commented Jun 24, 2018

Uh oh!

PGryllos commented Jun 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jun 26, 2018

Uh oh!

RianneSchouten commented Jun 26, 2018

Uh oh!

jeremiedbb commented Jun 27, 2018

Uh oh!

amueller commented Jun 10, 2019

Uh oh!

mokeddembillel commented Nov 24, 2020

Uh oh!

thomasjpfan commented Nov 24, 2020

Uh oh!

mokeddembillel commented Nov 25, 2020

Uh oh!

Jose-Verdu-Diaz commented May 4, 2023

Uh oh!

ogrisel commented Jun 6, 2018 •

edited

Loading

PGryllos commented Jun 26, 2018 •

edited

Loading