FIX IndexError in fetch_openml('zoo') #14623

HabchiSarra · 2019-08-10T19:46:12Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The shape extraction from data_qualities used NumberOfFeatures, which excluded the ignored features.
This exclusion caused a bug in the data conversion since we tried to reshape the whole dataset with a lower number of features.
This fix includes all features in the shape extraction.

sklearn/datasets/openml.py

thomasjpfan

There is a naming convention when dealing with data in sklearn/datasets/tests/data/openmp/62. You can learn about the conventions by looking at other data in the test directory.

For example the arff data is named data-v1-download-1.arff.gz, etc.

HabchiSarra · 2019-08-12T17:05:49Z

Thanks, @thomasjpfan.
I followed your recommendations and changed my commit accordingly.

amueller · 2019-08-12T19:11:51Z

tests are failing.

HabchiSarra · 2019-08-12T20:35:25Z

@amueller I fixed the tests.
However, for the failing doc build, I don't think it is related to my modifications.
Do you have an idea about its root cause or how to fix it?

amueller · 2019-08-12T20:38:39Z

Thanks! it's indeed unrelated to your changes, please merge with the current master branch which fixed this issue.

jnothman · 2019-08-13T01:48:54Z

sklearn/datasets/openml.py

-def _get_data_shape(data_qualities):
-    # Using the data_info dictionary from _get_data_info_by_name to extract
-    # the number of samples / features
+def _get_data_instances(data_qualities):


Suggested change

def _get_data_instances(data_qualities):

def _get_num_samples(data_qualities):

I think this would be clearer naming for readers of scikit-learn code.

I changed the name, thanks.

jnothman · 2019-08-13T01:50:27Z

sklearn/datasets/openml.py

+
+    Parameters
+    ----------
+    data_qualities : list


can't be a list. It's keyed by strings.

This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

Yes, it is a list of dict.
I updated the documentation accordingly.

jnothman · 2019-08-13T01:51:56Z

sklearn/datasets/openml.py

-        return (int(float(qualities['NumberOfInstances'])),
-                int(float(qualities['NumberOfFeatures'])))
+        instances = int(float(qualities['NumberOfInstances']))
    except AttributeError:


I don't see how an AttributeError would be raised here.

It should be a KeyError isn't it?

I changed the way we get the value, so it's not needed anymore.
See https://github.com/scikit-learn/scikit-learn/pull/14623/files#diff-ea672b15dd808c88257c58681d17bb6aR448

glemaitre

A couple of comments.

sklearn/datasets/openml.py

glemaitre · 2019-08-13T13:53:38Z

sklearn/datasets/openml.py

+
+    Parameters
+    ----------
+    data_qualities : list


This is indeed a list of dict. Each dict has 2 keys "name" and "values".
This looks really weird because we could have a single dict with "name" being the key and "values" the associated value but this is not the case. This is actually what the dict comprehension is doing in l.448.

data_qualities : list of dict

sklearn/datasets/openml.py

glemaitre · 2019-08-13T13:57:17Z

sklearn/datasets/openml.py

-        return (int(float(qualities['NumberOfInstances'])),
-                int(float(qualities['NumberOfFeatures'])))
+        instances = int(float(qualities['NumberOfInstances']))
    except AttributeError:


It should be a KeyError isn't it?

sklearn/datasets/openml.py

sklearn/datasets/tests/test_openml.py

HabchiSarra · 2019-08-13T18:57:37Z

I addressed the reviews, does the new version suit you?

thomasjpfan

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

glemaitre

nitpicking otherwise LGTM

sklearn/datasets/openml.py

glemaitre · 2019-08-13T20:12:41Z

sklearn/datasets/openml.py

-    # Using the data_info dictionary from _get_data_info_by_name to extract
-    # the number of samples / features
+def _get_num_samples(data_qualities):
+    """Get the number of samples from data qualities


missing full stop.

glemaitre · 2019-08-13T20:12:52Z

sklearn/datasets/openml.py

+    -------
+    instances : int
+        The number of samples in the dataset or -1 if data qualities are
+        unavailable


missing full stop

glemaitre · 2019-08-13T20:13:49Z

sklearn/datasets/openml.py

+
+    Returns
+    -------
+    instances : int


Suggested change

instances : int

n_samples : int

Ok, it's done.

The shape extraction from data_qualities was using NumberOfFeatures, which excluded the ignored features. This exclusion caused a bug in the data conversion, since we tried to reshape the whole dataset with a lower number of features. This fix uses data_features to include ignored features in the shape extraction Fixes scikit-learn#14340

jnothman · 2019-08-13T23:51:14Z

Thanks @HabchiSarra!

thomasjpfan reviewed Aug 12, 2019

View reviewed changes

sklearn/datasets/openml.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Aug 12, 2019

View reviewed changes

sklearn/datasets/openml.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Aug 12, 2019

View reviewed changes

HabchiSarra force-pushed the fix-ignored-features-openml branch from a2505a6 to 53223a6 Compare August 12, 2019 17:02

HabchiSarra force-pushed the fix-ignored-features-openml branch from 53223a6 to c6fdb40 Compare August 12, 2019 20:01

HabchiSarra force-pushed the fix-ignored-features-openml branch from c6fdb40 to 60bc2b6 Compare August 12, 2019 20:50

jnothman reviewed Aug 13, 2019

View reviewed changes

glemaitre requested changes Aug 13, 2019

View reviewed changes

HabchiSarra force-pushed the fix-ignored-features-openml branch from 60bc2b6 to e6e8a1d Compare August 13, 2019 18:34

thomasjpfan reviewed Aug 13, 2019

View reviewed changes

HabchiSarra force-pushed the fix-ignored-features-openml branch from e6e8a1d to 53921ca Compare August 13, 2019 19:53

glemaitre approved these changes Aug 13, 2019

View reviewed changes

HabchiSarra force-pushed the fix-ignored-features-openml branch from 53921ca to ea80272 Compare August 13, 2019 20:48

jnothman approved these changes Aug 13, 2019

View reviewed changes

jnothman merged commit e49b9d3 into scikit-learn:master Aug 13, 2019

	def _get_data_instances(data_qualities):
	def _get_num_samples(data_qualities):

Uh oh!

FIX IndexError in fetch_openml('zoo') #14623

FIX IndexError in fetch_openml('zoo') #14623

Uh oh!

Conversation

HabchiSarra commented Aug 10, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

HabchiSarra commented Aug 12, 2019

Uh oh!

amueller commented Aug 12, 2019

Uh oh!

HabchiSarra commented Aug 12, 2019

Uh oh!

amueller commented Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HabchiSarra commented Aug 13, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

amueller commented Aug 12, 2019 •

edited

Loading