Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jackzyliu
Copy link
Contributor

@jackzyliu jackzyliu commented Jun 17, 2021

Reference Issues/PRs

Closes #20081
helping #20150

What does this implement/fix? Explain your changes.

  • Replaces occurrences of __file__ with importlib.resources in sklearn/datasets/*
  • Compatibility with pyOxidizer and zipapp:
    • change IO functions to use importlib.resources.{open, read}_{binary, text}
    • a few path-based IO tests use importlib.resources.path to avoid making the assumption that resources already live in a filesystem

Any other comments?

  • What is the best way to test compatibility? Something like this?

sklearn/datasets:

- [x] sklearn/datasets/_base.py:357:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:484:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:601:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:750:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:857:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:956:    base_dir = join(dirname(__file__), "data/")
- [x] sklearn/datasets/_base.py:970:    with open(dirname(__file__) + "/descr/linnerud.rst") as f:
- [x] sklearn/datasets/_base.py:1052:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:1122:    module_path = join(dirname(__file__), "images")
- [x] sklearn/datasets/_kddcup99.py:205:    module_path = dirname(__file__)
- [x] sklearn/datasets/_lfw.py:338:    module_path = dirname(__file__)
- [x] sklearn/datasets/_lfw.py:528:    module_path = dirname(__file__)
- [x] sklearn/datasets/_rcv1.py:281:    module_path = dirname(__file__)
- [x] sklearn/datasets/_california_housing.py:176:    module_path = dirname(__file__)
- [x] sklearn/datasets/_olivetti_faces.py:140:    module_path = dirname(__file__)
- [x] sklearn/datasets/tests/test_svmlight_format.py:19:currdir = os.path.dirname(os.path.abspath(__file__))
- [x] sklearn/datasets/tests/test_openml.py:36:currdir = os.path.dirname(os.path.abspath(__file__))

Other (should probably be separate PR(s) since __file__ is used slightly differently in these occurrences):

- sklearn/__check_build/__init__.py:19:    local_dir = os.path.split(__file__)[0]
- sklearn/utils/__init__.py:1164:    root = str(Path(__file__).parent.parent)  # sklearn package
- sklearn/utils/_testing.py:733:        cwd = op.normpath(op.join(op.dirname(sklearn.__file__), ".."))


DATA_MODULE = "sklearn.datasets.data"
DESCR_MODULE = "sklearn.datasets.descr"
IMAGES_MODULE = "sklearn.datasets.images"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both string and module work. Using string here so that it can be more conveniently serialized (e.g. pickle) as part of a Bunch.

feature_names=feature_names,
filename=csv_filename,
filename=data_file_name,
file_module=DATA_MODULE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be compatible with pyOxidizer and zipapp, we can no longer pass around an absolute file path. Instead, we separately pass around filename (just the name of the file) and file_module that the file lives in.

assert all([os.path.exists(bunch.get(f, False)) for f in filenames])
assert all([
f in bunch and 'file_module' in bunch and
importlib_resources.is_resource(bunch['file_module'], bunch[f])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the test here to remove checks on the absolute paths.

descr = load_descr(IMAGES_MODULE, 'README.txt')

images_module_traversable = importlib_resources.files(IMAGES_MODULE)
with importlib_resources.as_file(images_module_traversable) as images_dir:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only available python>=3.9, so it could be problematic. So we probably need to find another (hopefully clean) way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the backport library importlib_resources for python<=3.8 as noted here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackzyliu Do you know if this is the reason for the failures in the PR? If so shall we try to make sure everything else is working and leave the images for after?

Copy link

@quendee quendee Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackzyliu Do you know if this is the reason for the failures in the PR checks? If so shall we try to make sure everything else is working and tackle the images after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it is.

I will make the images portion backward-compatible.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I a expressed in #20150 (comment) is to avoid use the open() Python builtin anywhere and instead using the functions from importlib.resources.open_binary/text or importlib.resources.read_binary/text documented here:

https://docs.python.org/3/library/importlib.html#module-importlib.resources

The goal is to avoid assuming those resource names are actual filenames of resource stored on a filesystem at any point.

@jackzyliu
Copy link
Contributor Author

The goal is to avoid assuming those resource names are actual filenames of resource stored on a filesystem at any point.

Gotcha. I will make these changes accordingly and make the .files portion backward-compatible.

target_names[0] is the name of the target[0] class.
"""
with open(join(module_path, "data", data_file_name)) as csv_file:
with importlib_resources.open_text(data_module, data_file_name) as csv_file:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.open_text has encoding="utf-8" by default (reference)

Happy to make it explicit though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to rely on the default utf-8 encoding assumption. I am pretty sure that all our CSV file only need the ascii encoding but this is a valid subset of utf-8 as far as I know.

We just needed to ensure that the loader would not implicitly try to use a platform dependent encoding.

return data, target, target_names


def load_gzip_compressed_csv_data(data_module, data_file_name, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can ask np.loadtxt to decompress if we are passing in file object (as opposed to a file name ending with '.gz'). So for the few .csv.gz files, we first manually decompress the file object and the use np.loadtxt to load data as before.

("breast_cancer.csv", 569, 30, ["malignant", "benign"]),
],
)
def test_load_csv_data(
Copy link
Contributor Author

@jackzyliu jackzyliu Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding standalone tests for load_csv_data and load_gzip_compressed_csv_data.


filenames = list()
images = list()
for filename in sorted(importlib_resources.contents(IMAGES_MODULE)):
Copy link
Contributor Author

@jackzyliu jackzyliu Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can directly use .contents() (which returns a list of file names and is available since python 3.7) to avoid using .files(). It was my mistake not remembering this option.

@jackzyliu
Copy link
Contributor Author

@ogrisel

Just updated according to your comments. Is it more along the line of what you were thinking?

I think the failing tests come from main. Will rebase after those are fixed.

@jackzyliu
Copy link
Contributor Author

jackzyliu commented Jun 23, 2021

@ogrisel @quendee

At this point, all the conversions are done for sklearn/datasets. I think we should probably have separate PR(s) for the other 3 occurrences (see below: e.g. sklearn/__check_build/__init__.py:19) since __file__ is used slightly differently in these cases.

For ease of review, I have one separate, mostly self-contained commit for each of the following component:

  1. sklearn/datasets/_base.py
  2. test adjustment for test_common.py
  3. sklearn/datasets/{_californa_housing, _covtype, _kddcup99, ...}.py: just the same changes applied to _base.py
  4. sklearn/datasets/tests/test_openml.py: This involves changing data directory names from digits to id_* to make them valid module names (thus the large number of file changes), but otherwise straightforward.
  5. sklearn/datasets/tests/test_svmlight_format.py: A few tests explicitly test path-based input for load_svmlight_file and load_svmlight_files. So for those tests, I used importlib.resources.path (to generate a temporary path within context manager if necessary), but otherwise used importlib.resources.open_binary as much as possible. Hope this is ok.

Let me know of comments.


sklearn/datasets:

- [x] sklearn/datasets/_base.py:357:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:484:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:601:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:750:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:857:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:956:    base_dir = join(dirname(__file__), "data/")
- [x] sklearn/datasets/_base.py:970:    with open(dirname(__file__) + "/descr/linnerud.rst") as f:
- [x] sklearn/datasets/_base.py:1052:    module_path = dirname(__file__)
- [x] sklearn/datasets/_base.py:1122:    module_path = join(dirname(__file__), "images")
- [x] sklearn/datasets/_kddcup99.py:205:    module_path = dirname(__file__)
- [x] sklearn/datasets/_lfw.py:338:    module_path = dirname(__file__)
- [x] sklearn/datasets/_lfw.py:528:    module_path = dirname(__file__)
- [x] sklearn/datasets/_rcv1.py:281:    module_path = dirname(__file__)
- [x] sklearn/datasets/_california_housing.py:176:    module_path = dirname(__file__)
- [x] sklearn/datasets/_olivetti_faces.py:140:    module_path = dirname(__file__)
- [x] sklearn/datasets/tests/test_svmlight_format.py:19:currdir = os.path.dirname(os.path.abspath(__file__))
- [x] sklearn/datasets/tests/test_openml.py:36:currdir = os.path.dirname(os.path.abspath(__file__))

Others (should probably be separate PR(s) since __file__ is used slightly differently in these occurrences):

- sklearn/__check_build/__init__.py:19:    local_dir = os.path.split(__file__)[0]
- sklearn/utils/__init__.py:1164:    root = str(Path(__file__).parent.parent)  # sklearn package
- sklearn/utils/_testing.py:733:        cwd = op.normpath(op.join(op.dirname(sklearn.__file__), ".."))

@jackzyliu
Copy link
Contributor Author

Updated whats new, and added a few .DESCR checks for the few modified data fetch functions in sklearn/datasets (e.g. fetch_olivetti_faces).

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackzyliu. This look good overall. Here are a few suggestion for further improvement. Once done I think that would be ready for merge on my end.

Comment on lines 344 to 399
fdescr = importlib_resources.read_text(descr_module, descr_file_name)

return fdescr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see most of the descriptions are loaded from DESCR_MODULE so to avoid redundancy we could have:

load_descr(descr_file_name, descr_module=DESCR_MODULE):

that would make it possible to only import load_descr without having to import the DESCR_MODULE constant most of the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience you could add an optional descr_file_name=None / descr_module=DESCR_MODULE to load_csv_data and load_gzip_compressed_csv_data and when descr_file_name is not None, make those functions return the loaded description as as additional return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see most of the descriptions are loaded from DESCR_MODULE so to avoid redundancy we could have:

load_descr(descr_file_name, descr_module=DESCR_MODULE):

that would make it possible to only import load_descr without having to import the DESCR_MODULE constant most of the time.

By extension, I am applying the same changes to load_csv_data and load_gzip_compressed_csv_data to avoid the same redundancy since these data are always loaded from DATA_MODULE.

load_csv_data(data_file_name, data_module=DATA_MODULE)

And since as positional arguments to importlib.resources functions, the resource module always comes before the resource name, I am enforcing named arguments to avoid usage error.

load_csv_data(data_file_name, *, data_module=DATA_MODULE)
load_descr(descr_file_name, *, descr_module=DESCR_MODULE)

Lastly, combined with the suggestion to add optional descr_file_name=None, descr_module=DESCR_MODULE, we have the following function signature:

load_csv_data(data_file_name, *, data_module=DATA_MODULE, descr_file_name=None, descr_module=DESCR_MODULE)

gzip_encoding = kwargs["encoding"]

compressed_file = gzip.open(compressed_file, mode="rt", encoding=gzip_encoding)
data = np.loadtxt(compressed_file, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we would pass the encoding kwarg to np.loadtxt if it has arleady been used to decode the text of gzip.open.

Either call gzip.open with mode="rb" and only pass encoding to np.loadtxt so that it handles the text decoding correctly itself or pass it only to gzip.open with model="rt" (in which case it would make sense to have load_gzip_compressed_csv_data(data_module, data_file_name, encoding="utf-8", **kwargs) to simplify the code to only forward kwargs to np.loadtxt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Although encoding is passed around a bunch of times in np.loadtxt, it's not really used when the file is already in text.

Going with the latter option.

Comment on lines 548 to 549
data_file_name = "iris.csv"
data, target, target_names = load_csv_data(DATA_MODULE, data_file_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the extra indirection in a local variable that is used only once.

Suggested change
data_file_name = "iris.csv"
data, target, target_names = load_csv_data(DATA_MODULE, data_file_name)
data, target, target_names = load_csv_data(DATA_MODULE, "iris.csv")

Copy link
Contributor Author

@jackzyliu jackzyliu Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_file_name is actually used again in constructing the Bunch below, which is why I used a variable. Do you still want me to make the change?

Edit: I think the same goes to breast_cancer.csv below.

Comment on lines 663 to 664
data_file_name = "breast_cancer.csv"
data, target, target_names = load_csv_data(DATA_MODULE, data_file_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_file_name = "breast_cancer.csv"
data, target, target_names = load_csv_data(DATA_MODULE, data_file_name)
data, target, target_names = load_csv_data(DATA_MODULE, "breast_cancer.csv")

assert len(bunch.target_names) == n_target
if has_descr:
assert bunch.DESCR
if filenames:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect a non-empty list of filenames, I think we should require the data module to be present in the bunch.

assert all(
[
f in bunch
and "file_module" in bunch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use the data_module field in all those Bunch instances:

Suggested change
and "file_module" in bunch
and "data_module" in bunch

) as modified_gzip:
original_data_module = OPENML_TEST_DATA_MODULE + "." + f"id_{data_id}"
original_data_file_name = "data-v1-dl-1666876.arff.gz"
corrupt_copy_path = os.path.join(tmpdir, "test_invalid_checksum.arff")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I am pretty sure this can be simplified as:

Suggested change
corrupt_copy_path = os.path.join(tmpdir, "test_invalid_checksum.arff")
corrupt_copy_path = tmpdir / "test_invalid_checksum.arff"

import scipy.sparse as sp
import os
import shutil
from importlib import resources as importlib_resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just use:

Suggested change
from importlib import resources as importlib_resources
from importlib import resources

and then call resources.open_binary / open_binary.open_text instead of renaming this module every where.

@jackzyliu
Copy link
Contributor Author

jackzyliu commented Jun 30, 2021

Hi @ogrisel, just updated according to your comments:

  • change signatures of underlying load functions to reduce various redundancies
  • allow load_descr within load_csv_data and load_gzip_compressed_csv_data
  • directly use from importlib import resources instead of renaming the module
  • rename module field in returning Bunch from file_module to data_module
  • style improvement (replace os.path.join with a /)

Build failure: codecov/patch shows failed, but the marked statements are all either 1) actually tested in 9cd899a (descr checks) or 2) themselves test statements.

Please let me know of further comments. Thanks!

@glemaitre
Copy link
Member

Regarding the failure, I don't think this is big deal. We need to be sure to run the test locally (they are passing on my side) but they will be skipped anyway on the CIs.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This only suggestions for the documentation to follow as much as possible numpydoc style

data_file_name : string
Name of csv file to be loaded from
module_path/data/data_file_name. For example 'wine_data.csv'.
data_module/data_file_name. For example 'wine_data.csv'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_module/data_file_name. For example 'wine_data.csv'.
`data_module/data_file_name`. For example `'wine_data.csv'`.

Comment on lines 261 to 263
data_module : string or module, optional
module where data lives;
default "sklearn.datasets.data" (DATA_MODULE).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_module : string or module, optional
module where data lives;
default "sklearn.datasets.data" (DATA_MODULE).
data_module : str or module, default="sklearn.datasets.data"
Module where data lives. The default is `"sklearn.datasets.data"`.

module where data lives;
default "sklearn.datasets.data" (DATA_MODULE).
descr_file_name : string, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
descr_file_name : string, optional
descr_file_name : str, default=None

Comment on lines 266 to 268
(See `load_descr`) Name of rst file to be loaded from
descr_module/descr_file_name. For example 'wine_data.rst';
If not None, also returns the corresponding description of
the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(See `load_descr`) Name of rst file to be loaded from
descr_module/descr_file_name. For example 'wine_data.rst';
If not None, also returns the corresponding description of
the dataset.
Name of rst file to be loaded from `descr_module/descr_file_name`.
For example 'wine_data.rst'. See also :func:`load_descr`.
If not None, also returns the corresponding description of
the dataset.

If not None, also returns the corresponding description of
the dataset.
descr_module : string or module, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
descr_module : string or module, optional
descr_module : str or module, default="sklearn.datasets.descr"

Comment on lines 363 to 365
descr : string, optional
Description of the dataset (content of descr_file_name). Only returned
if descr_file_name is not None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you format it as in the previous function.

Comment on lines 384 to 390
descr_file_name : string
Name of rst file to be loaded from
descr_module/descr_file_name. For example 'wine_data.rst'.
descr_module : string or module, optional
module where descr_file_name lives;
default "sklearn.datasets.descr" (DESCR_MODULE).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the docstring of the previous function as well.



def load_descr(descr_file_name, *, descr_module=DESCR_MODULE):
"""Loads descr_file_name from descr_module with importlib.resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Loads descr_file_name from descr_module with importlib.resources.
"""Load `descr_file_name` from `descr_module` with `importlib.resources`.

Returns
-------
string; content of descr_file_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string; content of descr_file_name
fdescr : str
Content of `descr_file_name`.

Comment on lines 1236 to 1237
filenames = list()
images = list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filenames = list()
images = list()
filenames, images = [], []

@jackzyliu
Copy link
Contributor Author

jackzyliu commented Jul 22, 2021

@glemaitre Thanks a lot for the review and suggestions! I have modified the documentations accordingly. Let me know if anything else needs changing.

Weirdly, the doc checks both seem have failed at build_doc.sh and specifically pip install "$(get_dep sphinxext-opengraph $SPHINXEXT_OPENGRAPH_VERSION)" where $SPHINXEXT_OPENGRAPH_VERSION didn't read in the values specified in config.yml. I wonder what could be the issue here.

I will wait for further comments before I push to force-run checks again.

@glemaitre
Copy link
Member

I think that you only need to merge main into your branch and it should be fine.

- change signatures of underlying load functions to reduce various redundancies
- allow `load_descr` within `load_csv_data` and `load_gzip_compressed_csv_data`
- undo renaming of `importlib.resources` in import
- rename module field in returning `Bunch`
- other style improvements
@jackzyliu
Copy link
Contributor Author

I think that you only need to merge main into your branch and it should be fine.

Thanks! All green now.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackzyliu ! LGTM as well. Merging with two approvals above.

@rth rth merged commit 5562cc5 into scikit-learn:main Jul 27, 2021
Copy link
Contributor

@nithish08 nithish08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Jul 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

__file__ should be avoided in library code

6 participants