-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] sample from a truncated normal instead of clipping samples from a normal #12177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] sample from a truncated normal instead of clipping samples from a normal #12177
Conversation
This looks nice :) I assume there's no simple way to write a non-regression test :\ ? |
Some tests fail with older versions of scipy. I think we might need a backport in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/fixes.py |
This pull request introduces 2 alerts when merging 21d7238 into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 4 alerts when merging 451b80d into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 5 alerts when merging f5b5d94 into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 3 alerts when merging 97e2d37 into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
…andom_state' parameter in freeze
This pull request introduces 3 alerts when merging bf0e49f into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
@@ -591,6 +592,28 @@ def test_iterative_imputer_clip(): | |||
assert_allclose(Xt[X != 0], X[X != 0]) | |||
|
|||
|
|||
def test_iterative_imputer_normal_posterior(): | |||
rng = np.random.RandomState(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here on what, intuitively, this test ensures
sklearn/tests/test_impute.py
Outdated
# generate multiple imputations for the single missing value | ||
imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)]) | ||
mu, sigma = imputations.mean(), imputations.std() | ||
ks_statistic, p_value = kstest((imputations-mu)/sigma, 'norm') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put spaces around binary operators
This pull request introduces 3 alerts when merging 2edb84c into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
for the backport, I just took the relevant scipy code and added it to a new file
|
This pull request introduces 3 alerts when merging 00a4dae into a4f2a89 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Actually, based on the consensus emerging from the discussion in #12184, I don't think we need to backport: scikit-learn 0.21 will require scipy >= 0.17.0. So please remove the backport and instead skip the tests that fail because of the lack of # TODO remove the skipif marker as soon as scipy < 0.17 support is dropped
@pytest.mark.skipif(not hasattr(scipy.stats, 'truncnorm'), 'need scipy.stats.truncnorm')
def test_something():
... |
…and remove non-ascii character in description for python2 support
cool, sounds good. I used the following to skip tests as scipy 0.14 has truncnorm, but it doesn't accept def test_something():
pytest.importorskip("scipy", minversion="0.17.0")
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add your name to the credits for IterativeImputer in doc/whats_new/v0.21.rst
doc/whats_new/v0.21.rst
Outdated
Multiple modules | ||
................ | ||
|
||
Changes to estimator checks | ||
--------------------------- | ||
|
||
These changes mostly affect library developers. | ||
These changes mo:tabstly affect library developers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, sorry about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments.
The new test is nice but it does really check that truncated normal != clipped normal right? But I guess this is fine. I am not sure what we could do to test better.
def test_iterative_imputer_normal_posterior(): | ||
# test that the values that are imputed using `sample_posterior=True` | ||
# with boundaries (`min_value` and `max_value` are not None) are drawn | ||
# from a distribution that looks gaussian via the Kolmogorov Smirnov test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that does not look Gaussian", right?
If the KS test p-value is larger than 0.1 the null hypothesis that the posterior samples are sampled from a standard normal distribution is rejected.
And this is expected because of the truncation by min_value
and max_value
that cut the tails of the posterior distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, it should read "that looks roughly Gaussian."
We would want to fail to reject the null hypothesis. We don't want the truncation to make the posterior no longer roughly Gaussian-looking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh alright sorry, it's just me having trouble with double negation reasoning.
# we want to fail to reject null hypothesis | ||
# null hypothesis: distributions are the same | ||
assert ks_statistic < 0.15 or p_value > 0.1, \ | ||
"The posterior does appear to be normal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the 0.15 threshold for the KS-statistic come from? Isn't p_value > 0.1
enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this assert statement based on my understanding of this line from scipy.stats.ks_2samp docs: "If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."
Empirically, a slightly higher threshold might be needed (0.2
) The p-values don't seem to differ much between the two branches (this branch/iterativeimputer branch). The threshold change is based on an mistake I made generating the test data: the values of variable X
were supposed to normally distributed, centered at 0, but I mistakenly used .random_sample
which is uniformly distributed from [0, 1), so it was unlikely min_value=0
would do much.
code used
experiment
from tqdm import tqdm
import joblib
import numpy as np
from scipy.stats import kstest
from sklearn.impute import IterativeImputer
stats, ps = [], []
for seed in tqdm(range(0, 100)):
# test that the values that are imputed using `sample_posterior=True`
# with boundaries (`min_value` and `max_value` are not None) are drawn
# from a distribution that looks gaussian via the Kolmogorov Smirnov test
rng = np.random.RandomState(seed)
X = rng.normal(size=(5, 5))
X[0][0] = np.nan
imputer = IterativeImputer(min_value=0,
max_value=0.5,
sample_posterior=True,
random_state=rng)
imputer.fit_transform(X)
# generate multiple imputations for the single missing value
imputations = np.array([imputer.transform(X)[0][0] for _ in range(1000)])
mu, sigma = imputations.mean(), imputations.std()
if sigma == 0:
sigma += 1e-12
ks_statistic, p_value = kstest((imputations - mu) / sigma, 'norm')
stats.append(ks_statistic)
ps.append(p_value)
joblib.dump((stats, ps), 'master_branch.joblib')
plotting
import joblib
import numpy as np
import matplotlib.pyplot as plt
clip_stats, clip_ps = joblib.load("./master_branch.joblib")
truc_stats, truc_ps = joblib.load("./new_branch.joblib")
clip_ps = np.array(clip_ps)[~np.isnan(clip_ps)]
clip_stats = np.array(clip_stats)[~np.isnan(clip_stats)]
truc_ps = np.array(truc_ps)[~np.isnan(truc_ps)]
truc_stats = np.array(truc_stats)[~np.isnan(truc_stats)]
plt.boxplot([truc_stats, clip_stats], labels=["Truncated normal", "Clipped normal"])
plt.hlines(0.15, 0, 3, linestyles='dotted', colors='r')
plt.title("Distribution of KS stats")
plt.savefig("ksstats.png")
plt.close()
plt.boxplot([truc_ps, clip_ps], labels=["Truncated normal", "Clipped normal"])
plt.title("Distribution of p-values")
plt.savefig("pvalues.png")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the details, this is very helpful.
sklearn/tests/test_impute.py
Outdated
@@ -591,6 +592,32 @@ def test_iterative_imputer_clip(): | |||
assert_allclose(Xt[X != 0], X[X != 0]) | |||
|
|||
|
|||
def test_iterative_imputer_normal_posterior(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this to test_iterative_imputer_truncated_normal_posterior
sklearn/impute.py
Outdated
loc=mus[good_sigmas], | ||
scale=sigmas[good_sigmas]) | ||
imputed_values[good_sigmas] = truncated_normal.rvs( | ||
random_state=self.random_state_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: this indentation does not seem to follow PEP8. I wonder why it wasn't caught by our CI.
The following should work:
imputed_values[good_sigmas] = truncated_normal.rvs(
random_state=self.random_state_)
doc/whats_new/v0.21.rst
Outdated
@@ -48,6 +48,10 @@ Support for Python 3.4 and below has been officially dropped. | |||
function of other features in a round-robin fashion. :issue:`8478` by | |||
:user:`Sergey Feldman <sergeyf>`. | |||
|
|||
- |Enhancement| :class:`impute.IterativeImputer` now samples from a truncated normal | |||
distrubtion instead of a clipped normal distribution when ``sample_posterior=True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: distribution
LGTM, merging. |
Reference Issues/PRs
Related to "Merge iterativeimputer branch into master" #11977
What does this implement/fix? Explain your changes.
When sampling from the posterior and boundary values are given, the current implementation clips values that are sampled from normal distribution. This can lead to undesired oversampling of the boundary values. For example, if our boundaries were [0, 2] and we have a mean of 0 and standard deviation of 1, the difference between sampling from a normal and clipping and a truncated normal is shown here:
When sampling from the posterior, this PR samples from a truncated normal distribution instead of clipping values that have been sampled from a normal distribution.
Any other comments?
To impute values within boundaries, the
MICE
R package appears to clip values[1] while themi
R package uses the truncated normal to sample[2] within the user given range. Open to discussing which approach is best.[1] page 149, https://cran.r-project.org/web/packages/mice/mice.pdf
[2] page 21, https://cran.r-project.org/web/packages/mi/mi.pdf