[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

benlawson · 2018-09-26T21:34:40Z

Reference Issues/PRs

Related to "Merge iterativeimputer branch into master" #11977

What does this implement/fix? Explain your changes.

When sampling from the posterior and boundary values are given, the current implementation clips values that are sampled from normal distribution. This can lead to undesired oversampling of the boundary values. For example, if our boundaries were [0, 2] and we have a mean of 0 and standard deviation of 1, the difference between sampling from a normal and clipping and a truncated normal is shown here:

from scipy.stats import norm, truncnorm
import matplotlib.pyplot as plt 
import numpy as np

norm_dis = norm(loc=0, scale=1)
truc_dis = truncnorm(a=0, b=2, loc=0, scale=1)
trucs = truc_dis.rvs(10000)
norms = norm_dis.rvs(10000)
norms = np.clip(norms, 0, 2)
plt.hist(norms, histtype='step')
plt.hist(trucs, histtype='step')
plt.legend(["clipped normal", "truncated normal"])
plt.show()

When sampling from the posterior, this PR samples from a truncated normal distribution instead of clipping values that have been sampled from a normal distribution.

Any other comments?

To impute values within boundaries, the MICE R package appears to clip values[1] while the mi R package uses the truncated normal to sample[2] within the user given range. Open to discussing which approach is best.

[1] page 149, https://cran.r-project.org/web/packages/mice/mice.pdf
[2] page 21, https://cran.r-project.org/web/packages/mi/mi.pdf

jnothman · 2018-09-27T00:05:41Z

This looks nice :)

I assume there's no simple way to write a non-regression test :\ ?

ogrisel · 2018-09-27T11:58:54Z

Some tests fail with older versions of scipy. I think we might need a backport in sklearn.utils.fixes:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/fixes.py

sklearn-lgtm · 2018-09-27T18:14:54Z

This pull request introduces 2 alerts when merging 21d7238 into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import

Comment posted by LGTM.com

sklearn-lgtm · 2018-09-27T19:46:21Z

This pull request introduces 4 alerts when merging 451b80d into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Missing call to __init__ during object initialization
1 for Module-level cyclic import

Comment posted by LGTM.com

sklearn-lgtm · 2018-09-27T20:40:01Z

This pull request introduces 5 alerts when merging f5b5d94 into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Missing call to __init__ during object initialization
1 for Module-level cyclic import
1 for First argument to super() is not enclosing class

Comment posted by LGTM.com

sklearn-lgtm · 2018-09-27T21:44:16Z

This pull request introduces 3 alerts when merging 97e2d37 into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Module-level cyclic import

Comment posted by LGTM.com

…andom_state' parameter in freeze

sklearn-lgtm · 2018-09-28T15:11:21Z

This pull request introduces 3 alerts when merging bf0e49f into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Module-level cyclic import

Comment posted by LGTM.com

jnothman · 2018-09-29T23:39:23Z

sklearn/tests/test_impute.py

@@ -591,6 +592,28 @@ def test_iterative_imputer_clip():
    assert_allclose(Xt[X != 0], X[X != 0])


+def test_iterative_imputer_normal_posterior():
+    rng = np.random.RandomState(0)


Add a comment here on what, intuitively, this test ensures

jnothman · 2018-09-29T23:40:44Z

sklearn/tests/test_impute.py

+    # generate multiple imputations for the single missing value
+    imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
+    mu, sigma = imputations.mean(), imputations.std()
+    ks_statistic, p_value = kstest((imputations-mu)/sigma, 'norm')


Please put spaces around binary operators

sklearn-lgtm · 2018-10-01T15:37:34Z

This pull request introduces 3 alerts when merging 2edb84c into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Module-level cyclic import

Comment posted by LGTM.com

benlawson · 2018-10-01T16:24:17Z

for the backport, I just took the relevant scipy code and added it to a new file sklearn/utils/_scipy_truncnorm_backport.py

is this the proper way to do backports?
if so, should I add the corresponding tests from scipy, for code-coverage check?

sklearn-lgtm · 2018-10-01T16:42:06Z

This pull request introduces 3 alerts when merging 00a4dae into a4f2a89 - view on LGTM.com

new alerts:

2 for Unused import
1 for Module-level cyclic import

Comment posted by LGTM.com

ogrisel · 2018-10-01T19:30:33Z

Actually, based on the consensus emerging from the discussion in #12184, I don't think we need to backport: scikit-learn 0.21 will require scipy >= 0.17.0.

So please remove the backport and instead skip the tests that fail because of the lack of scipy.stats.truncnorm:

# TODO remove the skipif marker as soon as scipy < 0.17 support is dropped
@pytest.mark.skipif(not hasattr(scipy.stats, 'truncnorm'), 'need scipy.stats.truncnorm')
def test_something():
    ...

…and remove non-ascii character in description for python2 support

benlawson · 2018-10-02T14:30:25Z

cool, sounds good. I used the following to skip tests as scipy 0.14 has truncnorm, but it doesn't accept random_state parameter

def test_something():
    pytest.importorskip("scipy", minversion="0.17.0")
    ...

jnothman

Please add your name to the credits for IterativeImputer in doc/whats_new/v0.21.rst

amueller · 2018-10-03T17:20:31Z

doc/whats_new/v0.21.rst

 Multiple modules
 ................

 Changes to estimator checks
 ---------------------------

-These changes mostly affect library developers.
+These changes mo:tabstly affect library developers.


whoops, sorry about that!

ogrisel

Here are some comments.

The new test is nice but it does really check that truncated normal != clipped normal right? But I guess this is fine. I am not sure what we could do to test better.

ogrisel · 2018-10-05T08:14:46Z

sklearn/tests/test_impute.py

+def test_iterative_imputer_normal_posterior():
+    #  test that the values that are imputed using `sample_posterior=True`
+    #  with boundaries (`min_value` and `max_value` are not None) are drawn
+    #  from a distribution that looks gaussian via the Kolmogorov Smirnov test


"that does not look Gaussian", right?

If the KS test p-value is larger than 0.1 the null hypothesis that the posterior samples are sampled from a standard normal distribution is rejected.

And this is expected because of the truncation by min_value and max_value that cut the tails of the posterior distribution.

Perhaps, it should read "that looks roughly Gaussian."

We would want to fail to reject the null hypothesis. We don't want the truncation to make the posterior no longer roughly Gaussian-looking

Oh alright sorry, it's just me having trouble with double negation reasoning.

ogrisel · 2018-10-05T08:15:54Z

sklearn/tests/test_impute.py

+    # we want to fail to reject null hypothesis
+    # null hypothesis: distributions are the same
+    assert ks_statistic < 0.15 or p_value > 0.1, \
+        "The posterior does appear to be normal"


Where does the 0.15 threshold for the KS-statistic come from? Isn't p_value > 0.1 enough?

I made this assert statement based on my understanding of this line from scipy.stats.ks_2samp docs: "If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."

Empirically, a slightly higher threshold might be needed (0.2) The p-values don't seem to differ much between the two branches (this branch/iterativeimputer branch). The threshold change is based on an mistake I made generating the test data: the values of variable X were supposed to normally distributed, centered at 0, but I mistakenly used .random_sample which is uniformly distributed from [0, 1), so it was unlikely min_value=0 would do much.

code used

experiment

from tqdm import tqdm import joblib import numpy as np from scipy.stats import kstest from sklearn.impute import IterativeImputer stats, ps = [], [] for seed in tqdm(range(0, 100)): # test that the values that are imputed using `sample_posterior=True` # with boundaries (`min_value` and `max_value` are not None) are drawn # from a distribution that looks gaussian via the Kolmogorov Smirnov test rng = np.random.RandomState(seed) X = rng.normal(size=(5, 5)) X[0][0] = np.nan imputer = IterativeImputer(min_value=0, max_value=0.5, sample_posterior=True, random_state=rng) imputer.fit_transform(X) # generate multiple imputations for the single missing value imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)]) mu, sigma = imputations.mean(), imputations.std() if sigma == 0: sigma += 1e-12 ks_statistic, p_value = kstest((imputations - mu) / sigma, 'norm') stats.append(ks_statistic) ps.append(p_value) joblib.dump((stats, ps), 'master_branch.joblib')

plotting

import joblib import numpy as np import matplotlib.pyplot as plt clip_stats, clip_ps = joblib.load("./master_branch.joblib") truc_stats, truc_ps = joblib.load("./new_branch.joblib") clip_ps = np.array(clip_ps)[~np.isnan(clip_ps)] clip_stats = np.array(clip_stats)[~np.isnan(clip_stats)] truc_ps = np.array(truc_ps)[~np.isnan(truc_ps)] truc_stats = np.array(truc_stats)[~np.isnan(truc_stats)] plt.boxplot([truc_stats, clip_stats], labels=["Truncated normal", "Clipped normal"]) plt.hlines(0.15, 0, 3, linestyles='dotted', colors='r') plt.title("Distribution of KS stats") plt.savefig("ksstats.png") plt.close() plt.boxplot([truc_ps, clip_ps], labels=["Truncated normal", "Clipped normal"]) plt.title("Distribution of p-values") plt.savefig("pvalues.png")

Thank you very much for the details, this is very helpful.

ogrisel · 2018-10-05T08:16:29Z

sklearn/tests/test_impute.py

@@ -591,6 +592,32 @@ def test_iterative_imputer_clip():
    assert_allclose(Xt[X != 0], X[X != 0])


+def test_iterative_imputer_normal_posterior():


I would rename this to test_iterative_imputer_truncated_normal_posterior

ogrisel · 2018-10-05T08:18:23Z

sklearn/impute.py

+                                               loc=mus[good_sigmas],
+                                               scale=sigmas[good_sigmas])
+            imputed_values[good_sigmas] = truncated_normal.rvs(
+                                               random_state=self.random_state_)


nitpick: this indentation does not seem to follow PEP8. I wonder why it wasn't caught by our CI.

The following should work:

imputed_values[good_sigmas] = truncated_normal.rvs( random_state=self.random_state_)

sklearn/tests/test_impute.py

ogrisel · 2018-10-05T08:22:46Z

doc/whats_new/v0.21.rst

@@ -48,6 +48,10 @@ Support for Python 3.4 and below has been officially dropped.
  function of other features in a round-robin fashion. :issue:`8478` by
  :user:`Sergey Feldman <sergeyf>`.

+- |Enhancement| :class:`impute.IterativeImputer` now samples from a truncated normal
+  distrubtion instead of a clipped normal distribution when ``sample_posterior=True``.


typo: distribution

ogrisel · 2018-10-05T20:01:27Z

LGTM, merging.

sample from the truncated normal instead of normal

e3e05ea

Ben Lawson added 3 commits September 27, 2018 10:02

add non-regression test for sampling from truncated norm

0465f3a

use random number generator for random sample

46c7c4f

add backport for scipy versions earlier than 0.17

21d7238

benlawson changed the title ~~sample from a truncated normal instead of clipping samples from a normal~~ [WIP] sample from a truncated normal instead of clipping samples from a normal Sep 27, 2018

Ben Lawson added 2 commits September 27, 2018 15:21

check scipy version (for common tests)"

451b80d

reduce scipy version to reflect when their api changed

37db243

Ben Lawson added 3 commits September 27, 2018 16:06

fix super() call for python2 support

441e2e2

fix the scipy version; 0.14 introduced the new api not 0.16

03b5578

fix flake8

f5b5d94

fix call to super, was using the wrong object

97e2d37

random_state wasn't being used in the backport, because of missing 'r…

bf0e49f

…andom_state' parameter in freeze

jnothman reviewed Sep 29, 2018

View reviewed changes

add description to test and fix whitespace for code style

2edb84c

remove non-ascii characters for python2 support

00a4dae

Ben Lawson added 3 commits October 1, 2018 19:51

add description to test and fix whitespace for code style

14060a9

add skips for calls to IterativeImputer with sample_posterior=True …

d056204

…and remove non-ascii character in description for python2 support

removing truncnorm backport

287286b

benlawson changed the title ~~[WIP] sample from a truncated normal instead of clipping samples from a normal~~ [MRG] sample from a truncated normal instead of clipping samples from a normal Oct 2, 2018

jnothman approved these changes Oct 3, 2018

View reviewed changes

adding credits

3a44f4e

amueller reviewed Oct 3, 2018

View reviewed changes

fix typo

c95d7a1

ogrisel reviewed Oct 5, 2018

View reviewed changes

Ben Lawson added 2 commits October 5, 2018 10:45

fix logistical comments (pep8, typos, renaming)

2d93667

adjust threshold and use normal to generate data

55304dd

ogrisel merged commit 09a9a21 into scikit-learn:iterativeimputer Oct 5, 2018

		@@ -591,6 +592,32 @@ def test_iterative_imputer_clip():
		assert_allclose(Xt[X != 0], X[X != 0])


		def test_iterative_imputer_normal_posterior():

Uh oh!

[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

[MRG] sample from a truncated normal instead of clipping samples from a normal #12177

Uh oh!

Conversation

benlawson commented Sep 26, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Sep 27, 2018

Uh oh!

ogrisel commented Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sklearn-lgtm commented Sep 27, 2018

Uh oh!

sklearn-lgtm commented Sep 27, 2018

Uh oh!

sklearn-lgtm commented Sep 27, 2018

Uh oh!

sklearn-lgtm commented Sep 27, 2018

Uh oh!

sklearn-lgtm commented Sep 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sklearn-lgtm commented Oct 1, 2018

Uh oh!

benlawson commented Oct 1, 2018

Uh oh!

sklearn-lgtm commented Oct 1, 2018

Uh oh!

ogrisel commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benlawson commented Oct 2, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

code used

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 5, 2018

Uh oh!

Uh oh!

ogrisel commented Sep 27, 2018 •

edited

Loading

ogrisel commented Oct 1, 2018 •

edited

Loading