Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] ENH: dataset-fetching with use figshare and checksum #9240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
Aug 3, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
773f0c5
add 20newsgroups dataset to figshare
nelson-liu Sep 14, 2016
a61c20f
made link less verbose
nelson-liu Sep 14, 2016
9e64651
add olivetti to figshare
nelson-liu Sep 14, 2016
b4866e6
add lfw to figshare
nelson-liu Sep 14, 2016
7068152
add california housing dataset to figshare
nelson-liu Sep 15, 2016
2082655
add covtype dataset to figshare
nelson-liu Sep 15, 2016
ff83bd1
add kddcup99 dataset to figshare
nelson-liu Sep 15, 2016
59eae87
add species distribution dataset to figshare
nelson-liu Sep 15, 2016
f33a52c
add rcv1 dataset
nelson-liu Sep 15, 2016
dfe24f9
remove extraneous parens from url strings
nelson-liu Oct 27, 2016
7186af8
check md5 of datasets and add resume functionality to downloads
nelson-liu Dec 24, 2016
4dc8946
remove extraneous print statements
nelson-liu Dec 24, 2016
7260f73
fix flake8 violations
nelson-liu Dec 24, 2016
f2c44ee
add docstrings to new dataset fetching functions
nelson-liu Dec 24, 2016
f6e6ce7
consolidate imports in base and use md5 check function in dl
nelson-liu Dec 25, 2016
983544e
remove accidentally removed import
nelson-liu Dec 25, 2016
03f7f82
attempt to fix docstring conventions / handle case where range header…
nelson-liu Dec 25, 2016
9d39dd0
change functions to used renamed, privatized utilities
nelson-liu Dec 25, 2016
5eadb3a
fix flake8 indentation error
nelson-liu Dec 25, 2016
79a0325
remove checks for joblib dumped files
nelson-liu Dec 27, 2016
29deaa5
fix error in lfw
nelson-liu Dec 27, 2016
269d028
Merge branch 'master' into use_figshare_in_datasets
nelson-liu Apr 27, 2017
773aa48
Add missing Bunch import in california housing
nelson-liu Apr 27, 2017
11c15db
Remove hash validation of 20news output pkl
nelson-liu Apr 28, 2017
f367815
Remove unused import
nelson-liu Apr 28, 2017
1637adb
Rebase 'master' into use_figshare_in_datasets
Jun 28, 2017
d11bc7a
address missing comments in #7429 to start the PR fresh
Jun 29, 2017
ef89676
update _fetch_and_verify_dataset function
Jun 29, 2017
7cf9422
update URL10
Jun 29, 2017
d604d49
Use strerr compatible with python2
Jul 4, 2017
7309779
Use warnings instead of StdErr (suggested by @lesteve)
Jul 4, 2017
0f7e66c
Fix pep8
Jul 4, 2017
0a9ca7d
Replace MD5 by SHA256
Jul 4, 2017
083acda
Fix cal_housing fetcher for the case of having the data locally
Jul 4, 2017
f48a919
Merge branch 'master' into use_figshare_in_datasets
Jul 10, 2017
38a4c02
Revert removing file when checksum fails
Jul 10, 2017
c9db0f3
Keep covertype's original URL as a comment
Jul 10, 2017
f991b2b
Rework the docstrings
Jul 10, 2017
fa1559f
Remove partial download
Jul 10, 2017
b8d8d5a
Add download compatibility with python 2.x
Jul 12, 2017
949d998
Add comment to clarify the usage passing a zipfile to np.load
Jul 13, 2017
7efa606
Fix typo
Jul 19, 2017
fead360
simplify some docstrings and functions
Jul 19, 2017
e7db2d8
Removed wired dictionaries to store remote metadata for lfw dataset
Jul 19, 2017
6601cbd
fixup! fix flake8 violations
Jul 19, 2017
2ffcfc1
Fix rcv1 and rename path to filename
lesteve Jul 19, 2017
02f5a7d
Cosmit
lesteve Jul 20, 2017
f54eabd
Add lfw missing checksum
Jul 20, 2017
3c210c2
Unify fetchers to use RemoteMetaData
Jul 20, 2017
a897f9f
revert logger info in favor of warning
Jul 21, 2017
88d7f61
Add original urls as comments and tides up PY3_OR_LATER
Jul 24, 2017
22130a9
use urlretrieve from six
Jul 24, 2017
d4f9456
remove fetch_url
Jul 24, 2017
38ba738
Rename _fetch_remote path parameter into dirname
lesteve Jul 25, 2017
5dfdafb
Use variable to remove repeated code
lesteve Jul 25, 2017
1286364
Return file_path from _fetch_remote
lesteve Jul 25, 2017
240bfe5
Remove blank lines after comments
lesteve Jul 25, 2017
60b1153
List all links
lesteve Jul 25, 2017
d1250a8
Fix lfw
lesteve Jul 25, 2017
580b131
Tweak comment
lesteve Jul 25, 2017
7295474
Use returned value for _fetch_remote
lesteve Jul 25, 2017
076efb1
Rename variable
lesteve Jul 25, 2017
7fc6627
Minor changes
lesteve Jul 25, 2017
de80947
checksum fix
lesteve Jul 25, 2017
ba862fb
Remove unused imports
lesteve Jul 25, 2017
7a5b9b6
Comment minor tweak
lesteve Jul 25, 2017
29a0301
Convert list of remotes into tuple of remotes to ensure immutability
Aug 1, 2017
bf869a6
Move from print statements to logging
Aug 1, 2017
6daa256
Configure root logger in sklearn/__init__.py
lesteve Aug 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions sklearn/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@
import warnings
import os
from contextlib import contextmanager as _contextmanager
import logging

logger = logging.getLogger(__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "sklearn" instead of __name__?

Copy link
Member

@lesteve lesteve Aug 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the just the general convention right?

I found this from the Python doc

A good convention to use when naming loggers is to use a module-level logger, in each module which uses logging, named as follows:

logger = logging.getLogger(__name__)

and this from the Hitchhiker's guide to Python:

Best practice when instantiating loggers in a library is to only create them using the __name__ global variable: the logging module creates a hierarchy of loggers using dot notation, so using __name__ ensures no name collisions.

logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

_ASSUME_FINITE = bool(os.environ.get('SKLEARN_ASSUME_FINITE', False))

Expand Down
169 changes: 107 additions & 62 deletions sklearn/datasets/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,40 @@
# 2010 Fabian Pedregosa <[email protected]>
# 2010 Olivier Grisel <[email protected]>
# License: BSD 3 clause
from __future__ import print_function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any print statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used in doctests. line: 509 and 674


import os
import csv
import sys
import shutil
from os import environ
from os.path import dirname
from os.path import join
from os.path import exists
from os.path import expanduser
from os.path import isdir
from os.path import splitext
from os import listdir
from os import makedirs
from collections import namedtuple
from os import environ, listdir, makedirs
from os.path import dirname, exists, expanduser, isdir, join, splitext
import hashlib

from ..utils import Bunch
from ..utils import check_random_state

import numpy as np

from ..utils import check_random_state
from sklearn.externals.six.moves.urllib.request import urlretrieve

RemoteFileMetadata = namedtuple('RemoteFileMetadata',
['filename', 'url', 'checksum'])


def get_data_home(data_home=None):
"""Return the path of the scikit-learn data dir.

This folder is used by some large dataset loaders to avoid
downloading the data several times.
This folder is used by some large dataset loaders to avoid downloading the
data several times.

By default the data dir is set to a folder named 'scikit_learn_data'
in the user home folder.
By default the data dir is set to a folder named 'scikit_learn_data' in the
user home folder.

Alternatively, it can be set by the 'SCIKIT_LEARN_DATA' environment
variable or programmatically by giving an explicit folder path. The
'~' symbol is expanded to the user home folder.
variable or programmatically by giving an explicit folder path. The '~'
symbol is expanded to the user home folder.

If the folder does not already exist, it is automatically created.
"""
Expand Down Expand Up @@ -76,23 +77,22 @@ def load_files(container_path, description=None, categories=None,
file_44.txt
...

The folder names are used as supervised signal label names. The
individual file names are not important.
The folder names are used as supervised signal label names. The individual
file names are not important.

This function does not try to extract features into a numpy array or
scipy sparse matrix. In addition, if load_content is false it
does not try to load the files in memory.
This function does not try to extract features into a numpy array or scipy
sparse matrix. In addition, if load_content is false it does not try to
load the files in memory.

To use text files in a scikit-learn classification or clustering
algorithm, you will need to use the `sklearn.feature_extraction.text`
module to build a feature extraction transformer that suits your
problem.
To use text files in a scikit-learn classification or clustering algorithm,
you will need to use the `sklearn.feature_extraction.text` module to build
a feature extraction transformer that suits your problem.

If you set load_content=True, you should also specify the encoding of
the text using the 'encoding' parameter. For many modern text files,
'utf-8' will be the correct encoding. If you leave encoding equal to None,
then the content will be made of bytes instead of Unicode, and you will
not be able to use most functions in `sklearn.feature_extraction.text`.
If you set load_content=True, you should also specify the encoding of the
text using the 'encoding' parameter. For many modern text files, 'utf-8'
will be the correct encoding. If you leave encoding equal to None, then the
content will be made of bytes instead of Unicode, and you will not be able
to use most functions in `sklearn.feature_extraction.text`.

Similar feature extractors should be built for other kind of unstructured
data input such as images, audio, video, ...
Expand All @@ -109,20 +109,19 @@ def load_files(container_path, description=None, categories=None,
reference, etc.

categories : A collection of strings or None, optional (default=None)
If None (default), load all the categories.
If not None, list of category names to load (other categories ignored).
If None (default), load all the categories. If not None, list of
category names to load (other categories ignored).

load_content : boolean, optional (default=True)
Whether to load or not the content of the different files. If
true a 'data' attribute containing the text information is present
in the data structure returned. If not, a filenames attribute
gives the path to the files.
Whether to load or not the content of the different files. If true a
'data' attribute containing the text information is present in the data
structure returned. If not, a filenames attribute gives the path to the
files.

encoding : string or None (default is None)
If None, do not try to decode the content of the files (e.g. for
images or other non-text content).
If not None, encoding to use to decode text files to Unicode if
load_content is True.
If None, do not try to decode the content of the files (e.g. for images
or other non-text content). If not None, encoding to use to decode text
files to Unicode if load_content is True.

decode_error : {'strict', 'ignore', 'replace'}, optional
Instruction on what to do if a byte sequence is given to analyze that
Expand Down Expand Up @@ -262,16 +261,15 @@ def load_wine(return_X_y=False):
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, and 'DESCR', the
full description of the dataset.
Dictionary-like object, the interesting attributes are: 'data', the
data to learn, 'target', the classification labels, 'target_names', the
meaning of the labels, 'feature_names', the meaning of the features,
and 'DESCR', the full description of the dataset.

(data, target) : tuple if ``return_X_y`` is True

The copy of UCI ML Wine Data Set dataset is
downloaded and modified to fit standard format from:
The copy of UCI ML Wine Data Set dataset is downloaded and modified to fit
standard format from:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

Examples
Expand Down Expand Up @@ -332,8 +330,8 @@ def load_iris(return_X_y=False):
Parameters
----------
return_X_y : boolean, default=False.
If True, returns ``(data, target)`` instead of a Bunch object.
See below for more information about the `data` and `target` object.
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` object.

.. versionadded:: 0.18

Expand Down Expand Up @@ -709,15 +707,15 @@ def load_boston(return_X_y=False):

def load_sample_images():
"""Load sample images for image manipulation.

Loads both, ``china`` and ``flower``.

Returns
-------
data : Bunch
Dictionary-like object with the following attributes :
'images', the two sample images, 'filenames', the file
names for the images, and 'DESCR'
the full description of the dataset.
Dictionary-like object with the following attributes : 'images', the
two sample images, 'filenames', the file names for the images, and
'DESCR' the full description of the dataset.

Examples
--------
Expand Down Expand Up @@ -799,18 +797,18 @@ def load_sample_image(image_name):
def _pkl_filepath(*args, **kwargs):
"""Ensure different filenames for Python 2 and Python 3 pickles

An object pickled under Python 3 cannot be loaded under Python 2.
An object pickled under Python 2 can sometimes not be loaded
correctly under Python 3 because some Python 2 strings are decoded as
Python 3 strings which can be problematic for objects that use Python 2
strings as byte buffers for numerical data instead of "real" strings.
An object pickled under Python 3 cannot be loaded under Python 2. An object
pickled under Python 2 can sometimes not be loaded correctly under Python 3
because some Python 2 strings are decoded as Python 3 strings which can be
problematic for objects that use Python 2 strings as byte buffers for
numerical data instead of "real" strings.

Therefore, dataset loaders in scikit-learn use different files for pickles
manages by Python 2 and Python 3 in the same SCIKIT_LEARN_DATA folder so
as to avoid conflicts.
manages by Python 2 and Python 3 in the same SCIKIT_LEARN_DATA folder so as
to avoid conflicts.

args[-1] is expected to be the ".pkl" filename. Under Python 3, a
suffix is inserted before the extension to s
args[-1] is expected to be the ".pkl" filename. Under Python 3, a suffix is
inserted before the extension to s

_pkl_filepath('/path/to/folder', 'filename.pkl') returns:
- /path/to/folder/filename.pkl under Python 2
Expand All @@ -823,3 +821,50 @@ def _pkl_filepath(*args, **kwargs):
basename += py3_suffix
new_args = args[:-1] + (basename + ext,)
return join(*new_args)


def _sha256(path):
"""Calculate the sha256 hash of the file at path."""
sha256hash = hashlib.sha256()
chunk_size = 8192
with open(path, "rb") as f:
while True:
buffer = f.read(chunk_size)
if not buffer:
break
sha256hash.update(buffer)
return sha256hash.hexdigest()


def _fetch_remote(remote, dirname=None):
"""Helper function to download a remote dataset into path

Fetch a dataset pointed by remote's url, save into path using remote's
filename and ensure its integrity based on the SHA256 Checksum of the
downloaded file.

Parameters
-----------
remote : RemoteFileMetadata
Named tuple containing remote dataset meta information: url, filename
and checksum

dirname : string
Directory to save the file to.

Returns
-------
file_path: string
Full path of the created file.
"""

file_path = (remote.filename if dirname is None
else join(dirname, remote.filename))
urlretrieve(remote.url, file_path)
checksum = _sha256(file_path)
if remote.checksum != checksum:
raise IOError("{} has an SHA256 checksum ({}) "
"differing from expected ({}), "
"file may be corrupted.".format(file_path, checksum,
remote.checksum))
return file_path
37 changes: 20 additions & 17 deletions sklearn/datasets/california_housing.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,33 +21,33 @@
# Authors: Peter Prettenhofer
# License: BSD 3 clause

from io import BytesIO
from os.path import exists
from os import makedirs
from os import makedirs, remove
import tarfile

try:
# Python 2
from urllib2 import urlopen
except ImportError:
# Python 3+
from urllib.request import urlopen

import numpy as np
import logging

from .base import get_data_home
from ..utils import Bunch
from .base import _fetch_remote
from .base import _pkl_filepath
from .base import RemoteFileMetadata
from ..utils import Bunch
from ..externals import joblib


DATA_URL = "http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"
TARGET_FILENAME = "cal_housing.pkz"
# The original data can be found at:
# http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
ARCHIVE = RemoteFileMetadata(
filename='cal_housing.tgz',
url='https://ndownloader.figshare.com/files/5976036',
checksum=('aaa5c9a6afe2225cc2aed2723682ae40'
'3280c4a3695a2ddda4ffb5d8215ea681'))

# Grab the module-level docstring to use as a description of the
# dataset
MODULE_DOCS = __doc__

logger = logging.getLogger(__name__)

def fetch_california_housing(data_home=None, download_if_missing=True):
"""Loader for the California housing dataset from StatLib.
Expand Down Expand Up @@ -89,17 +89,20 @@ def fetch_california_housing(data_home=None, download_if_missing=True):
if not exists(data_home):
makedirs(data_home)

filepath = _pkl_filepath(data_home, TARGET_FILENAME)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')
if not exists(filepath):
if not download_if_missing:
raise IOError("Data not found and `download_if_missing` is False")

print('downloading Cal. housing from %s to %s' % (DATA_URL, data_home))
archive_fileobj = BytesIO(urlopen(DATA_URL).read())
logger.info('Downloading Cal. housing from {} to {}'.format(
ARCHIVE.url, data_home))
archive_path = _fetch_remote(ARCHIVE, dirname=data_home)

fileobj = tarfile.open(
mode="r:gz",
fileobj=archive_fileobj).extractfile(
name=archive_path).extractfile(
'CaliforniaHousing/cal_housing.data')
remove(archive_path)

cal_housing = np.loadtxt(fileobj, delimiter=',')
# Columns are not in the same order compared to the previous
Expand Down
Loading