Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fetch_openml migration between 0.20.0 and 0.20.1 #12517

@rth

Description

@rth

There have been quite a few changes in fetch_openml since the 0.20.0 release. It would be helpful to check that our code to read cached responses from OpenML is backward compatible between 0.20.0 and master.

For instance when loading MNIST for #12504 and switching between 0.20 and master I got,

In [1]: from sklearn.datasets import fetch_openml
In [2]: fetch_openml('mnist_784', version=1, return_X_y=True)                                                                                                
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-0c2a4453f0a2> in <module>
----> 1 fetch_openml('mnist_784', version=1, return_X_y=True)

~/src/scikit-learn/sklearn/datasets/openml.py in fetch_openml(name, version, data_id, data_home, target_column, cache, return_X_y)
    490                 "specify a numeric data_id or a name, not "
    491                 "both.".format(data_id, name))
--> 492         data_info = _get_data_info_by_name(name, version, data_home)
    493         data_id = data_info['did']
    494     elif data_id is not None:

~/src/scikit-learn/sklearn/datasets/openml.py in _get_data_info_by_name(name, version, data_home)
    285     url = (_SEARCH_NAME + "/data_version/{}").format(name, version)
    286     json_data = _get_json_content_from_openml_api(url, None, False,
--> 287                                                   data_home)
    288     if json_data is None:
    289         # we can do this in 1 function call if OpenML does not require the

~/src/scikit-learn/sklearn/datasets/openml.py in _get_json_content_from_openml_api(url, error_message, raise_if_error, data_home)
    143         else:
    144             return None
--> 145     json_data = json.loads(response.read().decode("utf-8"))
    146     response.close()
    147     return json_data

~/.miniconda3/envs/sklearn-dev/lib/python3.7/gzip.py in read(self, size)
    274             import errno
    275             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 276         return self._buffer.read(size)
    277 
    278     def read1(self, size=-1):

~/.miniconda3/envs/sklearn-dev/lib/python3.7/gzip.py in read(self, size)
    461                 # jump to the next member, if there is one.
    462                 self._init_read()
--> 463                 if not self._read_gzip_header():
    464                     self._size = self._pos
    465                     return b""

~/.miniconda3/envs/sklearn-dev/lib/python3.7/gzip.py in _read_gzip_header(self)
    409 
    410         if magic != b'\037\213':
--> 411             raise OSError('Not a gzipped file (%r)' % magic)
    412 
    413         (method, flag,

OSError: Not a gzipped file (b'{"')

I'm not sure if this is due to the fact that the load had issues in 0.20.0 but in any case the general behaviour when some cached response cannot be loaded or parsed should be to raise a warning and re-download it anew (instead of failing), I think.

In the above case, manually removing ~/scikit_learn_data fixed it, but users shouldn't have to do it.

cc @janvanrijn

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions