Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG + 1]Deprecating 1D inputs in fast_mcd #5234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 12, 2015

Conversation

vighneshbirodkar
Copy link
Contributor

Addresses #4512

I am just waiting to make sure that Travis does not throw the 1D deprecation warning

@amueller
Copy link
Member

amueller commented Sep 9, 2015

Can you add a test that it does throw a deprecation warning with 1d input?

@vighneshbirodkar
Copy link
Contributor Author

@amueller
Given 1D input the function thinks it is a single sample array, and the later part of it doesn't really work well given only one sample. I could get it to work properly with 4 samples.

@amueller amueller changed the title [WIP]Deprecating 1D inputs in fast_mcd [MRG + 1]Deprecating 1D inputs in fast_mcd Sep 9, 2015
@amueller
Copy link
Member

amueller commented Sep 9, 2015

LGTM. Now it warns that 1d isn't good, at least. Having a warning for the right number of samples is for another day.


X = np.arange(100)
try:
assert_warns(DeprecationWarning, fast_mcd, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check that the DeprectionWarning is only due to the 1D X using assert_warns_message, if you don't mind ;)

@vighneshbirodkar
Copy link
Contributor Author

@amueller @rvraghav93
I have kept it as it because the same approach was followed in the last PR #5152

@amueller amueller added this to the 0.18 milestone Sep 22, 2015
@amueller
Copy link
Member

ping @ogrisel

@@ -39,6 +40,15 @@ def test_mcd():
# 1D data set
launch_mcd_on_dataset(500, 1, 100, 0.001, 0.001, 350)

def test_fast_mcd():

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the style of the other functions in this file and the project in general, please remove this blank line.

@ogrisel
Copy link
Member

ogrisel commented Sep 23, 2015

I don't really see the point of raising a deprecation warning before raising an exception with an obscure error message:

>>> import numpy as np
>>> from sklearn.covariance import fast_mcd
>>> fast_mcd(np.arange(100))
/volatile/ogrisel/code/scikit-learn/sklearn/utils/validation.py:372: DeprecationWarning: Passing 1d arrays as data is deprecated and will be removed in 0.18. Reshape your data either usingX.reshape(-1, 1) if your data has a single feature orX.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
/volatile/ogrisel/code/scikit-learn/sklearn/covariance/empirical_covariance_.py:75: UserWarning: Only one sample available. You may want to reshape your data array
  warnings.warn("Only one sample available. "
Traceback (most recent call last):
  File "<ipython-input-4-6294736be2de>", line 1, in <module>
    fast_mcd(np.arange(100))
  File "/volatile/ogrisel/code/scikit-learn/sklearn/covariance/robust_covariance.py", line 487, in fast_mcd
    random_state=random_state)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/covariance/robust_covariance.py", line 274, in select_candidates
    random_state=random_state))
  File "/volatile/ogrisel/code/scikit-learn/sklearn/covariance/robust_covariance.py", line 146, in _c_step
    "Singular covariance matrix. "
ValueError: Singular covariance matrix. Please check that the covariance matrix corresponding to the dataset is full rank and that MinCovDet is used with Gaussian-distributed data (or at least data drawn from a unimodal, symmetric distribution.

In my opinion it would be better to directly raise a ValueError that states that computing covariance on 1D data is invalid:

X = check_array(X, ensure_2d=False, ensure_min_samples=2)
if X.ndim == 1:
    raise ValueError('Calling fast_mcd on a 1D array is invalid.')

The ensure_min_samples=2 makes sure to also raise a an informative error message when 2D data is passed with a single sample (which is not valid either).

WDYT @amueller?

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel There are some other places right now where a 1 sample or 1 feature input doesn't make sense. Would all of them would eventually be fixed this way ? Why not change check_array to accept an optional message that will be thrown when the input has too few features or samples.

@ogrisel
Copy link
Member

ogrisel commented Oct 1, 2015

Why not change check_array to accept an optional message that will be thrown when the input has too few features or samples.

I improved the error messages in #5334 by adding the estimator name when provided:

>>> from sklearn.cluster import AgglomerativeClustering
>>> AgglomerativeClustering().fit([[1, 0, -1]])
Traceback (most recent call last):
  File "<ipython-input-7-2d36cbf780b4>", line 1, in <module>
    AgglomerativeClustering().fit([[1, 0, -1]])
  File "/volatile/ogrisel/code/scikit-learn/sklearn/cluster/hierarchical.py", line 716, in fit
    X = check_array(X, ensure_min_samples=2, estimator=self)
  File "/volatile/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 403, in check_array
    context))
ValueError: Found array with 1 sample(s) (shape=(1, 3)) while a minimum of 2 is required by AgglomerativeClustering.

@ogrisel
Copy link
Member

ogrisel commented Oct 1, 2015

@ogrisel There are some other places right now where a 1 sample or 1 feature input doesn't make sense.

For estimators where 1 features does not make sense I think the current state that is improved in #5334 is fine.

For the case where ensure_min_samples=2 (or more) and ensure_2d=True I think we could raise a better ValueError directly in check_array as you suggest (using the estimator name in that message).

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel If I am not wrong the error will be raised 2 versions later right ? Is there anything more you would expect here?

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2015

@ogrisel If I am not wrong the error will be raised 2 versions later right ?

In 2 versions, 1dim input will be rejected with a stronger, more generic error message like: "estimator expect 2 dimensional array-like as input, got 1 instead".

@ogrisel
Copy link
Member

ogrisel commented Oct 2, 2015

Is there anything more you would expect here?

Can you please rebase this on top of the current master and change check_array to deal specifically with the case ensure_min_samples >= 2 and ensure_2d and X.ndim == 1 to raise a specific (temporary) ValueError with an error message stating that "estimator expects at least 2 samples provided in a 2 dimensional array-like input".

@ogrisel ogrisel modified the milestones: 0.17, 0.18 Oct 2, 2015
@vighneshbirodkar
Copy link
Contributor Author

@ogrisel I have rebased, but I am not sure I understand what you mean. If I modify check_array to handle the specific case that you mentioned, won't it break other people's existing code ? And what exactly do you mean by temporary ?

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2015

@ogrisel I have rebased, but I am not sure I understand what you mean. If I modify check_array to handle the specific case that you mentioned, won't it break other people's existing code ?

This will impact only code that use check_array with ensure_min_samples >= 2. Such code will always raise an exception on 1d input albeit with obscure error message as the array elements are interpreted as features of a single sample hence the ensure_min_samples >= 2 condition does not hold.

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2015

And what exactly do you mean by temporary ?

The deprecation period spans from 0.17 to 0.19. Starting in 0.19 we will always raise a ValueError on 1d input, whatever the value of ensure_min_samples.

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel Thank you, done

@@ -374,6 +374,9 @@ def check_array(array, accept_sparse=None, dtype="numeric", order=None,

if ensure_2d:
if array.ndim == 1:
if ensure_min_samples >= 2:
raise ValueError("%s expects at least 2 samples provided "
"in a 2 dimensional array-like input")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation level does not follow PEP8, use a linter such as https://pypi.python.org/pypi/pep8 or better: https://pypi.python.org/pypi/flake8 to spot such issues.

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel All done, thank you for your patience.

@@ -40,6 +42,17 @@ def test_mcd():
launch_mcd_on_dataset(500, 1, 100, 0.001, 0.001, 350)


def test_fast_mcd_on_invalid_input():
X = np.arange(100)
assert_raise_message(ValueError, 'expects at least 2 samples', fast_mcd, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check for the presence of the class name in the message: 'fast_mcd expects at least 2 samples'.

def test_mcd_class_on_invalid_input():
X = np.arange(100)
mcd = MinCovDet()
assert_raise_message(ValueError, 'expects at least 2 samples', mcd.fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check for the presence of the class name in the message: 'MinCovDet expects at least 2 samples'.

@vighneshbirodkar
Copy link
Contributor Author

@ogrisel All done

@amueller
Copy link
Member

amueller commented Oct 9, 2015

@vighneshbirodkar can you squash all your changes please?
@ogrisel mrg?

@amueller
Copy link
Member

amueller commented Oct 9, 2015

Also: I agree, raising a value error with a good message is better.

Added fast_mcd 1D test

fixed syntax error cause py34 was failing

raise value error in check_array when min_samples>=2

formatting

added checl_array to MCD class and a test for the class

pep8 formatting

fixed string formatting and modified tests
@vighneshbirodkar
Copy link
Contributor Author

@amueller Done

@amueller
Copy link
Member

amueller commented Oct 9, 2015

thanks :)

@ogrisel
Copy link
Member

ogrisel commented Oct 12, 2015

Thank you very much for your patience @vighneshbirodkar ! Merging now!

ogrisel added a commit that referenced this pull request Oct 12, 2015
[MRG + 1]Deprecating 1D inputs in fast_mcd
@ogrisel ogrisel merged commit 8d273a1 into scikit-learn:master Oct 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants