Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Raise exception on providing complex data to estimators #9551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 1, 2017

Conversation

pravarmahajan
Copy link
Contributor

Reference Issue

Fixes #9528

What does this implement/fix? Explain your changes.

An exception should be raised on giving complex data as input to the estimators. The changed code
does the same, via the function check_array. A new test function has been added to test_validation.py
corresponding to the code changes.

Any other comments?

jrbourbeau and others added 4 commits August 13, 2017 15:26
- Fixes rendering of docstring examples
- Instead of importing cross_val_score in example, cross_validate is imported
…into complex_data

merging changes from the master branch
if not hasattr(dtype_orig, 'kind'):
# not a data type (e.g. a column named dtype in a pandas DataFrame)
dtype_orig = None

if dtype_orig is not None and dtype_orig.kind == "c":
raise ValueError("Complex data is not supported\n{}\n".format(array))
elif isinstance(array, list) or isinstance(array, tuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best way. We might also not catch everything with it (how about dataframes that contain complex data?). I'm not sure this is easy to catch later if dtype is not None, though. With your addition we're converting into an array twice: here and in 411. It would be great if we could use the conversion at 411, but as I said, it's not entirely clear to me how to do that.

@amueller
Copy link
Member

I just realized that we are not doing anything with complex data, as we pass anything that's not object through as numeric.
I would have expected we convert it to float64, which would give a

ComplexWarning: Casting complex values to real discards the imaginary part
warning.

Looking into the numeric part, things don't look great overall:

check_array([['a', 'b']], dtype="numeric")
passes the check?!

I guess that's in accordance with the documentation, though.
I think my preferred solution would be to only catch complex data when dtype is not None, and catch the ComplexWarning in the cast and raise an error.

If dtype was numeric, we should check after the array creation, as we do for "O" data in 425.

Can you please also add a test to check_estimators.py?

@pravarmahajan
Copy link
Contributor Author

Thank you @amueller . Let me make changes as you have suggested.

@pravarmahajan
Copy link
Contributor Author

pravarmahajan commented Aug 15, 2017

@amueller
I am working on the changes as you have suggested, a couple of questions as I was trying to understand the code:
(1) In case the passed argument dtype is a list or a tuple, we are either setting dtype to None or taking the first element of the list. Why do we even allow dtype to be list when we are discarding everything other than the first element?
(2) In line 425-426, array is casted as type np.float64 in case dtype of the array is 'O'. I believe this case is already taken care of by lines 382-387 and then 411. (all line numbers are according to the pull request I created above). Therefore, is line 425 needed?

@pravarmahajan
Copy link
Contributor Author

pravarmahajan commented Aug 15, 2017

Can you please also add a test to check_estimators.py?

Did you mean test_estimator_checks.py in sklearn/utils/tests?

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@pravarmahajan
Copy link
Contributor Author

@jnothman
Got it, thank you!

@amueller
Copy link
Member

Answering 2):
Lines 382-387 only work if an array was passed. If a list was passed, we can't know what the dtype is until after we converted to an array.

@pravarmahajan
Copy link
Contributor Author

Thanks @amueller
I have made changes as you suggested and added a test to estimator_checks.py

@@ -433,6 +433,44 @@ def test_check_array_min_samples_and_features_messages():
assert_array_equal(y, y_checked)


def test_check_array_complex_data_error():
X = np.array([[1 + 2j, 3 + 4j, 5 + 7j], [2 + 3j, 4 + 5j, 6 + 7j]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of repeating the content every time create a list of lists in the beginning and then create everything else from that, so it's more obvious what is tested? Don't have a strong opinion though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought of that, but it becomes very messy for some of the cases. For example converting list of list to tuple of tuple, or to tuple of np-arrays. Maybe I can add one line comment for what case is being tested.

warnings.simplefilter('error', ComplexWarning)
array = np.array(array, dtype=dtype, order=order, copy=copy)
except ComplexWarning:
raise ValueError("Complex data not supported\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this branch is covered, but I don't see where. We're never passing dtype right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never passing dtype to check_array? Even if we don't pass it to check_array, the value of dtype changes within this function. For example lines 381-386. The ComplexWarning comes only if we are setting dtype to some one of the real types. However, if dtype is "None", no conversion takes place and no warning is produced, so we need to check that case again in subsequentl lines

Copy link
Contributor Author

@pravarmahajan pravarmahajan Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example where dtype is explicitly passed to the function is that of svm decision function

@amueller
Copy link
Member

LGTM, I think unfortunately this is the easiest we can get away with. I'm a bit confused about when we hit the ComplexWarning branch, though.

array = np.array(array, dtype=dtype, order=order, copy=copy)
except ComplexWarning:
raise ValueError("Complex data not supported\n"
"{}\n".format(array))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, at this point the array has been converted to float or int and the error message will be confusing. We need to keep a reference to the original input with complex values to report it in this error message (and del original_array otherwise to let the gabage collector free the memory as soon as possible otherwise).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

        with warnings.catch_warnings():
            try:
                warnings.simplefilter('error', ComplexWarning)
                new_array = np.array(array, dtype=dtype, order=order, copy=copy)
            except ComplexWarning:
                raise ValueError("Complex data not supported\n"
                                 "{}\n".format(array))
            array = new_array
            del new_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum sorry my previous comments are wrong. Your code is fine.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 once my comment is addressed

@ogrisel ogrisel merged commit ecc96be into scikit-learn:master Sep 1, 2017
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
…-learn#9551)

* Modifies model_selection.cross_validate docstring (scikit-learn#9534)

- Fixes rendering of docstring examples
- Instead of importing cross_val_score in example, cross_validate is imported

* raise error on complex data input to estimators

* Raise exception on providing complex data to estimators

* adding checks to check_estimator for complex data

* removing some unnecessary parts

* autopep8 changes

* removing ipdb, restoring some autopep8 fixes

* removing ipdb, restoring some autopep8 fixes

* adding documentation for complex data handling

* adding one line explanation for each test case
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…-learn#9551)

* Modifies model_selection.cross_validate docstring (scikit-learn#9534)

- Fixes rendering of docstring examples
- Instead of importing cross_val_score in example, cross_validate is imported

* raise error on complex data input to estimators

* Raise exception on providing complex data to estimators

* adding checks to check_estimator for complex data

* removing some unnecessary parts

* autopep8 changes

* removing ipdb, restoring some autopep8 fixes

* removing ipdb, restoring some autopep8 fixes

* adding documentation for complex data handling

* adding one line explanation for each test case
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…-learn#9551)

* Modifies model_selection.cross_validate docstring (scikit-learn#9534)

- Fixes rendering of docstring examples
- Instead of importing cross_val_score in example, cross_validate is imported

* raise error on complex data input to estimators

* Raise exception on providing complex data to estimators

* adding checks to check_estimator for complex data

* removing some unnecessary parts

* autopep8 changes

* removing ipdb, restoring some autopep8 fixes

* removing ipdb, restoring some autopep8 fixes

* adding documentation for complex data handling

* adding one line explanation for each test case
@lesteve
Copy link
Member

lesteve commented Feb 23, 2018

Stupid question, why did we not use array.dtype.kind == 'c' to detect complex arrays? This seems simpler than catching a warning ...

@pravarmahajan
Copy link
Contributor Author

@lesteve
Are you talking about this part?

with warnings.catch_warnings():
            try:
                warnings.simplefilter('error', ComplexWarning)
                new_array = np.array(array, dtype=dtype, order=order, copy=copy)
            except ComplexWarning:
                raise ValueError("Complex data not supported\n"
                                 "{}\n".format(array))
            array = new_array
            del new_array

The array here is not necessarily an ndarray object, it could be a list of list too. In the latter case array.dtype.kind==c won't work. So it's easier to try convert the array to ndarray, and check for any complex warnings raised and then throw that warning as exception.

@lesteve
Copy link
Member

lesteve commented Feb 26, 2018

OK fair enough ... I feel like this could be simplified but it will require a bit more thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Estimators run with complex data and give wrong results
6 participants