Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benlawson
Copy link
Contributor

Reference Issues/PRs

Related to "Merge iterativeimputer branch into master" #11977

What does this implement/fix? Explain your changes.

When sampling from the posterior and boundary values are given, the current implementation clips values that are sampled from normal distribution. This can lead to undesired oversampling of the boundary values. For example, if our boundaries were [0, 2] and we have a mean of 0 and standard deviation of 1, the difference between sampling from a normal and clipping and a truncated normal is shown here:

from scipy.stats import norm, truncnorm
import matplotlib.pyplot as plt 
import numpy as np

norm_dis = norm(loc=0, scale=1)
truc_dis = truncnorm(a=0, b=2, loc=0, scale=1)
trucs = truc_dis.rvs(10000)
norms = norm_dis.rvs(10000)
norms = np.clip(norms, 0, 2)
plt.hist(norms, histtype='step')
plt.hist(trucs, histtype='step')
plt.legend(["clipped normal", "truncated normal"])
plt.show()

example

When sampling from the posterior, this PR samples from a truncated normal distribution instead of clipping values that have been sampled from a normal distribution.

Any other comments?

To impute values within boundaries, the MICE R package appears to clip values[1] while the mi R package uses the truncated normal to sample[2] within the user given range. Open to discussing which approach is best.

[1] page 149, https://cran.r-project.org/web/packages/mice/mice.pdf
[2] page 21, https://cran.r-project.org/web/packages/mi/mi.pdf

@jnothman
Copy link
Member

This looks nice :)

I assume there's no simple way to write a non-regression test :\ ?

@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2018

Some tests fail with older versions of scipy. I think we might need a backport in sklearn.utils.fixes:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/fixes.py

@benlawson benlawson changed the title sample from a truncated normal instead of clipping samples from a normal [WIP] sample from a truncated normal instead of clipping samples from a normal Sep 27, 2018
@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 21d7238 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 4 alerts when merging 451b80d into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Missing call to __init__ during object initialization
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 5 alerts when merging f5b5d94 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Missing call to __init__ during object initialization
  • 1 for Module-level cyclic import
  • 1 for First argument to super() is not enclosing class

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 97e2d37 into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging bf0e49f into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@@ -591,6 +592,28 @@ def test_iterative_imputer_clip():
assert_allclose(Xt[X != 0], X[X != 0])


def test_iterative_imputer_normal_posterior():
rng = np.random.RandomState(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here on what, intuitively, this test ensures

# generate multiple imputations for the single missing value
imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
mu, sigma = imputations.mean(), imputations.std()
ks_statistic, p_value = kstest((imputations-mu)/sigma, 'norm')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put spaces around binary operators

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 2edb84c into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@benlawson
Copy link
Contributor Author

for the backport, I just took the relevant scipy code and added it to a new file sklearn/utils/_scipy_truncnorm_backport.py

  1. is this the proper way to do backports?
  2. if so, should I add the corresponding tests from scipy, for code-coverage check?

@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 00a4dae into a4f2a89 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Module-level cyclic import

Comment posted by LGTM.com

@ogrisel
Copy link
Member

ogrisel commented Oct 1, 2018

Actually, based on the consensus emerging from the discussion in #12184, I don't think we need to backport: scikit-learn 0.21 will require scipy >= 0.17.0.

So please remove the backport and instead skip the tests that fail because of the lack of scipy.stats.truncnorm:

# TODO remove the skipif marker as soon as scipy < 0.17 support is dropped
@pytest.mark.skipif(not hasattr(scipy.stats, 'truncnorm'), 'need scipy.stats.truncnorm')
def test_something():
    ...

@benlawson
Copy link
Contributor Author

cool, sounds good. I used the following to skip tests as scipy 0.14 has truncnorm, but it doesn't accept random_state parameter

def test_something():
    pytest.importorskip("scipy", minversion="0.17.0")
    ...

@benlawson benlawson changed the title [WIP] sample from a truncated normal instead of clipping samples from a normal [MRG] sample from a truncated normal instead of clipping samples from a normal Oct 2, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add your name to the credits for IterativeImputer in doc/whats_new/v0.21.rst

Multiple modules
................

Changes to estimator checks
---------------------------

These changes mostly affect library developers.
These changes mo:tabstly affect library developers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, sorry about that!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments.

The new test is nice but it does really check that truncated normal != clipped normal right? But I guess this is fine. I am not sure what we could do to test better.

def test_iterative_imputer_normal_posterior():
# test that the values that are imputed using `sample_posterior=True`
# with boundaries (`min_value` and `max_value` are not None) are drawn
# from a distribution that looks gaussian via the Kolmogorov Smirnov test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"that does not look Gaussian", right?

If the KS test p-value is larger than 0.1 the null hypothesis that the posterior samples are sampled from a standard normal distribution is rejected.

And this is expected because of the truncation by min_value and max_value that cut the tails of the posterior distribution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, it should read "that looks roughly Gaussian."

We would want to fail to reject the null hypothesis. We don't want the truncation to make the posterior no longer roughly Gaussian-looking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh alright sorry, it's just me having trouble with double negation reasoning.

# we want to fail to reject null hypothesis
# null hypothesis: distributions are the same
assert ks_statistic < 0.15 or p_value > 0.1, \
"The posterior does appear to be normal"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the 0.15 threshold for the KS-statistic come from? Isn't p_value > 0.1 enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this assert statement based on my understanding of this line from scipy.stats.ks_2samp docs: "If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."

Empirically, a slightly higher threshold might be needed (0.2) The p-values don't seem to differ much between the two branches (this branch/iterativeimputer branch). The threshold change is based on an mistake I made generating the test data: the values of variable X were supposed to normally distributed, centered at 0, but I mistakenly used .random_sample which is uniformly distributed from [0, 1), so it was unlikely min_value=0 would do much.

ksstats
pvalues

code used

experiment
from tqdm import tqdm
import joblib

import numpy as np
from scipy.stats import kstest
from sklearn.impute import IterativeImputer

stats, ps = [], []
for seed in tqdm(range(0, 100)):
    #  test that the values that are imputed using `sample_posterior=True`
    #  with boundaries (`min_value` and `max_value` are not None) are drawn
    #  from a distribution that looks gaussian via the Kolmogorov Smirnov test
    rng = np.random.RandomState(seed)

    X = rng.normal(size=(5, 5))
    X[0][0] = np.nan

    imputer = IterativeImputer(min_value=0,
                               max_value=0.5,
                               sample_posterior=True,
                               random_state=rng)

    imputer.fit_transform(X)
    # generate multiple imputations for the single missing value
    imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
    mu, sigma = imputations.mean(), imputations.std()
    if sigma == 0:
        sigma += 1e-12
    ks_statistic, p_value = kstest((imputations - mu) / sigma, 'norm')
    stats.append(ks_statistic)
    ps.append(p_value)
joblib.dump((stats, ps), 'master_branch.joblib')
plotting
import joblib
import numpy as np
import matplotlib.pyplot as plt

clip_stats, clip_ps = joblib.load("./master_branch.joblib")
truc_stats, truc_ps = joblib.load("./new_branch.joblib")

clip_ps = np.array(clip_ps)[~np.isnan(clip_ps)]
clip_stats = np.array(clip_stats)[~np.isnan(clip_stats)]
truc_ps = np.array(truc_ps)[~np.isnan(truc_ps)]
truc_stats = np.array(truc_stats)[~np.isnan(truc_stats)]

plt.boxplot([truc_stats, clip_stats], labels=["Truncated normal", "Clipped normal"])
plt.hlines(0.15, 0, 3, linestyles='dotted', colors='r')
plt.title("Distribution of KS stats")
plt.savefig("ksstats.png")
plt.close()
plt.boxplot([truc_ps, clip_ps], labels=["Truncated normal", "Clipped normal"])
plt.title("Distribution of p-values")
plt.savefig("pvalues.png")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the details, this is very helpful.

@@ -591,6 +592,32 @@ def test_iterative_imputer_clip():
assert_allclose(Xt[X != 0], X[X != 0])


def test_iterative_imputer_normal_posterior():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this to test_iterative_imputer_truncated_normal_posterior

loc=mus[good_sigmas],
scale=sigmas[good_sigmas])
imputed_values[good_sigmas] = truncated_normal.rvs(
random_state=self.random_state_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: this indentation does not seem to follow PEP8. I wonder why it wasn't caught by our CI.

The following should work:

            imputed_values[good_sigmas] = truncated_normal.rvs(
                random_state=self.random_state_)

@@ -48,6 +48,10 @@ Support for Python 3.4 and below has been officially dropped.
function of other features in a round-robin fashion. :issue:`8478` by
:user:`Sergey Feldman <sergeyf>`.

- |Enhancement| :class:`impute.IterativeImputer` now samples from a truncated normal
distrubtion instead of a clipped normal distribution when ``sample_posterior=True``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: distribution

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2018

LGTM, merging.

@ogrisel ogrisel merged commit 09a9a21 into scikit-learn:iterativeimputer Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants