Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@HabchiSarra
Copy link
Contributor

Reference Issues/PRs

Fixes #14340

What does this implement/fix? Explain your changes.

The shape extraction from data_qualities used NumberOfFeatures, which excluded the ignored features.
This exclusion caused a bug in the data conversion since we tried to reshape the whole dataset with a lower number of features.
This fix includes all features in the shape extraction.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a naming convention when dealing with data in sklearn/datasets/tests/data/openmp/62. You can learn about the conventions by looking at other data in the test directory.

For example the arff data is named data-v1-download-1.arff.gz, etc.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from a2505a6 to 53223a6 Compare August 12, 2019 17:02
@HabchiSarra
Copy link
Contributor Author

Thanks, @thomasjpfan.
I followed your recommendations and changed my commit accordingly.

@amueller
Copy link
Member

tests are failing.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 53223a6 to c6fdb40 Compare August 12, 2019 20:01
@HabchiSarra
Copy link
Contributor Author

@amueller I fixed the tests.
However, for the failing doc build, I don't think it is related to my modifications.
Do you have an idea about its root cause or how to fix it?

@amueller
Copy link
Member

amueller commented Aug 12, 2019

Thanks! it's indeed unrelated to your changes, please merge with the current master branch which fixed this issue.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from c6fdb40 to 60bc2b6 Compare August 12, 2019 20:50
def _get_data_shape(data_qualities):
# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_data_instances(data_qualities):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_data_instances(data_qualities):
def _get_num_samples(data_qualities):

I think this would be clearer naming for readers of scikit-learn code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name, thanks.

Parameters
----------
data_qualities : list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't be a list. It's keyed by strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a list of dict.
I updated the documentation accordingly.

return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how an AttributeError would be raised here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a KeyError isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the way we get the value, so it's not needed anymore.
See https://github.com/scikit-learn/scikit-learn/pull/14623/files#diff-ea672b15dd808c88257c58681d17bb6aR448

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments.

Parameters
----------
data_qualities : list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

return (int(float(qualities['NumberOfInstances'])),
int(float(qualities['NumberOfFeatures'])))
instances = int(float(qualities['NumberOfInstances']))
except AttributeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a KeyError isn't it?

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 60bc2b6 to e6e8a1d Compare August 13, 2019 18:34
@HabchiSarra
Copy link
Contributor Author

I addressed the reviews, does the new version suit you?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from e6e8a1d to 53921ca Compare August 13, 2019 19:53
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking otherwise LGTM

# Using the data_info dictionary from _get_data_info_by_name to extract
# the number of samples / features
def _get_num_samples(data_qualities):
"""Get the number of samples from data qualities
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop.

-------
instances : int
The number of samples in the dataset or -1 if data qualities are
unavailable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop

Returns
-------
instances : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
instances : int
n_samples : int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's done.

The shape extraction from data_qualities was using NumberOfFeatures,
which excluded the ignored features.
This exclusion caused a bug in the data conversion, since we tried
to reshape the whole dataset with a lower number of features.

This fix uses data_features to include ignored features in the shape
extraction

Fixes scikit-learn#14340
@HabchiSarra HabchiSarra force-pushed the fix-ignored-features-openml branch from 53921ca to ea80272 Compare August 13, 2019 20:48
@jnothman
Copy link
Member

Thanks @HabchiSarra!

@jnothman jnothman merged commit e49b9d3 into scikit-learn:master Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fetch_openml('zoo') raises IndexError in sklearn.datasets.openml._convert_arff_data

5 participants