Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add strategy="random" to SimpleImputer #11209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jun 6, 2018 · 15 comments
Open

Add strategy="random" to SimpleImputer #11209

ogrisel opened this issue Jun 6, 2018 · 15 comments
Labels
module:impute Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@ogrisel
Copy link
Member

ogrisel commented Jun 6, 2018

The purpose would be to provide an unbiased, stochastic, univariate imputer. The imputer would replace missing values by values sampled uniformly at random from the non-missing values of the sample column.

This would require to add additional random_state constructor parameter to the class.

As for #11208 this strategy should work both on numerical and non-numerical dtypes.

@ogrisel
Copy link
Member Author

ogrisel commented Jun 6, 2018

Maybe this should be named strategy="sample".

@jnothman
Copy link
Member

This also requires not using statistics_ at transform time, so I suspect it should be a different class

@jnothman jnothman added New Feature Easy Well-defined and straightforward way to resolve help wanted labels Jun 15, 2018
@jeremiedbb
Copy link
Member

Since I'm currently working on the SimpleImputer, I'd like to help on that.
Right now, statistics_ stores 1 statistic per column since the implemented strategies only need to impute a constant value in each column. This strategy is therefor quite different from the other ones. However, it's still a simple strategy, compared to the MICEImputer.
We can make a new class, say SampleImputer (maybe to close to SimpleImputer ?).
Another solution could be to store the induced probability distribution in statistics_ and make a special treatment for "sample" strategy at transform time.

@jnothman
Copy link
Member

I'd prefer SamplingImputer to SimpleImputer!

@jnothman
Copy link
Member

@RianneSchouten also volunteered to contribute this: #8478 (comment)

@jnothman
Copy link
Member

Ideas for tests. For any single-column training data X with non-zero nanvar, the following should hold:

values = np.unique(X)
values = values[~_get_mask(values, missing_values)]

Xts = []
for i in range(100):
    Xt = imputer.set_params(random_state=i).fit_transform(X)
    assert_array_equal(values, np.unique(Xt))
    Xts.append(Xt)

assert not np.allclose(Xts[-1], Xts[-2])  # or more strictly, some indices should fail np.allclose
assert np.mean(np.concatenate(Xts)) == pytest.approx(np.nanmean(X))

Behaviour at test time (i.e. transform might be harder to define) because we don't have many stochastic transformers. If a random seed is fixed, should every call to transform with the same data produce the same imputation? This would imply that fit_transform and transform are likely to draw the same samples. I'm not sure how problematic that is.

@PGryllos
Copy link
Contributor

PGryllos commented Jun 26, 2018

Is someone working actively on this? Otherwise can I take a look?

@jeremiedbb
Copy link
Member

I thought @RianneSchouten was working on it, so I decided to step back, but I already have some code for that. Now that it seems that she's not actively working on that, I'll make a PR soon.

@RianneSchouten
Copy link

I don't mind whoever does it. I am quite busy so if you can are faster with making a PR, don't hesitate.

@jeremiedbb
Copy link
Member

I made a PR to add this feature. Since I was not sure about new strategy vs new class, I made both. Tell me which one you think is best.

@amueller
Copy link
Member

I agree on adding a different class and by default use binning.

@mokeddembillel
Copy link

Hi guys, is any of you still working on this issue or not, i want to help

@thomasjpfan
Copy link
Member

There is discussing surrounding this issue in #11368

At a high level, it looks like the solution needs benchmarking to see how it performs when compared to other imputing strategies.

@mokeddembillel
Copy link

@thomasjpfan Thanks for letting me know, i'll try to catch up

@cmarmo cmarmo removed the Easy Well-defined and straightforward way to resolve label Apr 23, 2021
@cmarmo cmarmo added Needs Decision - Include Feature Requires decision regarding including feature module:impute labels Jan 15, 2022
@Jose-Verdu-Diaz
Copy link

Any updates on this? I would love to use random imputation as a baseline for comparing the performance of other imputation methods. This would be equivalent to DummyClassifier with strategy="uniform".

I don't think this is trivial, as the imputer should ideally:

  • Support both categorical and numerical
  • Support both continuous and discrete numerical variables (this could be fixed after imputation by the end-user)
  • Support randomly sampling from a list of user-provided values
  • Support sampling from a uniform distribution with a min/max value provided by the user

Probably the easier and faster way to implement this and provide minimum functionality would be sampling from a uniform distribution where the min/max values are the min/max non-missing values, column-wise. The end-user can then round the values to discretize them, if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:impute Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

10 participants