-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG + 1] Warn on 1D arrays, addresses #4511 #5152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -70,6 +70,7 @@ def empirical_covariance(X, assume_centered=False): | |
X = np.asarray(X) | ||
if X.ndim == 1: | ||
X = np.reshape(X, (1, -1)) | ||
|
||
if X.shape[0] == 1: | ||
warnings.warn("Only one sample available. " | ||
"You may want to reshape your data array") | ||
|
@@ -79,6 +80,8 @@ def empirical_covariance(X, assume_centered=False): | |
else: | ||
covariance = np.cov(X.T, bias=1) | ||
|
||
if covariance.ndim == 0: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So it is not really related to the issue, but to respect the docstring. I guess this is ok as a bugfix. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not behaving as the doc says causes problems because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's ok. |
||
covariance = np.array([[covariance]]) | ||
return covariance | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -334,7 +334,10 @@ def __init__(self, alpha=.01, mode='cd', tol=1e-4, enet_tol=1e-4, | |
self.store_precision = True | ||
|
||
def fit(self, X, y=None): | ||
X = check_array(X) | ||
|
||
# Covariance does not make sense for a single feature | ||
X = check_array(X, ensure_min_features=2, ensure_min_samples=2) | ||
|
||
if self.assume_centered: | ||
self.location_ = np.zeros(X.shape[1]) | ||
else: | ||
|
@@ -557,7 +560,8 @@ def fit(self, X, y=None): | |
X : ndarray, shape (n_samples, n_features) | ||
Data from which to compute the covariance estimate | ||
""" | ||
X = check_array(X) | ||
# Covariance does not make sense for a single feature | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add this to the fit function to changed above, too? |
||
X = check_array(X, ensure_min_features=2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again why is this non-default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR also makes sure that sensible errors are raised. |
||
if self.assume_centered: | ||
self.location_ = np.zeros(X.shape[1]) | ||
else: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,8 +55,8 @@ def test_covariance(): | |
cov.error_norm(empirical_covariance(X_1d), norm='spectral'), 0) | ||
|
||
# test with one sample | ||
# FIXME I don't know what this test does | ||
X_1sample = np.arange(5) | ||
# Create X with 1 sample and 5 features | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, right. I was confused by weird reshaping. |
||
X_1sample = np.arange(5).reshape(1, 5) | ||
cov = EmpiricalCovariance() | ||
assert_warns(UserWarning, cov.fit, X_1sample) | ||
assert_array_almost_equal(cov.covariance_, | ||
|
@@ -172,8 +172,8 @@ def test_ledoit_wolf(): | |
assert_array_almost_equal(empirical_covariance(X_1d), lw.covariance_, 4) | ||
|
||
# test with one sample | ||
# FIXME I don't know what this test does | ||
X_1sample = np.arange(5) | ||
# warning should be raised when using only 1 sample | ||
X_1sample = np.arange(5).reshape(1, 5) | ||
lw = LedoitWolf() | ||
assert_warns(UserWarning, lw.fit, X_1sample) | ||
assert_array_almost_equal(lw.covariance_, | ||
|
@@ -220,7 +220,7 @@ def test_oas(): | |
assert_array_almost_equal(scov.covariance_, oa.covariance_, 4) | ||
|
||
# test with n_features = 1 | ||
X_1d = X[:, 0].reshape((-1, 1)) | ||
X_1d = X[:, 0:1] | ||
oa = OAS(assume_centered=True) | ||
oa.fit(X_1d) | ||
oa_cov_from_mle, oa_shinkrage_from_mle = oas(X_1d, assume_centered=True) | ||
|
@@ -259,8 +259,8 @@ def test_oas(): | |
assert_array_almost_equal(empirical_covariance(X_1d), oa.covariance_, 4) | ||
|
||
# test with one sample | ||
# FIXME I don't know what this test does | ||
X_1sample = np.arange(5) | ||
# warning should be raised when using only 1 sample | ||
X_1sample = np.arange(5).reshape(1, 5) | ||
oa = OAS() | ||
assert_warns(UserWarning, oa.fit, X_1sample) | ||
assert_array_almost_equal(oa.covariance_, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,6 +39,7 @@ | |
|
||
from sklearn.tree.tree import SPARSE_SPLITTERS | ||
|
||
|
||
# toy sample | ||
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]] | ||
y = [-1, -1, -1, 1, 1, 1] | ||
|
@@ -724,6 +725,7 @@ def test_memory_layout(): | |
yield check_memory_layout, name, dtype | ||
|
||
|
||
@ignore_warnings | ||
def check_1d_input(name, X, X_2d, y): | ||
ForestEstimator = FOREST_ESTIMATORS[name] | ||
assert_raises(ValueError, ForestEstimator(random_state=0).fit, X, y) | ||
|
@@ -735,8 +737,9 @@ def check_1d_input(name, X, X_2d, y): | |
assert_raises(ValueError, est.predict, X) | ||
|
||
|
||
@ignore_warnings | ||
def test_1d_input(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure whether these tests should be kept There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think they should be kept for now with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test is failing now as we are not doing any 1d to 2d conversion. |
||
X = iris.data[:, 0].ravel() | ||
X = iris.data[:, 0] | ||
X_2d = iris.data[:, 0].reshape((-1, 1)) | ||
y = iris.target | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I am not opposed to this change, why is this needed? How is it related to the purpose of this PR? Is this something specific in hierarchical clustering? If so we should probably have an inline comment here to motivate this non default value.
If this is because the default value for
n_clusters
is2
then I think we should have:instead. That would be more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it is really hard to test that this PR doesn't break anything. I thought it was a good idea to include a test that sensible things happen with 2d X with one sample or one feature. It turns out that many of our estimators break.
This should probably be a different PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with 1 cluster as well, as we found out in this test
https://github.com/vighneshbirodkar/scikit-learn/blob/array_1d_fix/sklearn/utils/estimator_checks.py#L392
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah and I think you are right that it is linked to the the default number of clusters. @vighneshbirodkar please change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller This failed with clusters said to 1 as well. See my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember the line that failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked clustering with
n_clusters > n_samples
and we already get a good error message from line 586 (in master).On master also, clustering 1 sample with 1 cluster crashes with:
I am fine with setting
ensure_min_samples=2
as a stopgap in this PR. At least it's better than the state of master.