Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DOC add warning regarding the load_boston function #20729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Aug 17, 2021

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Aug 10, 2021

closes #16155
closes #18594

Add permanent warning in the documentation.

I think that we still need to settle on the following because it was not super clear from the last message in #18594 (at least to my understanding).

TODO:

  • Add a UserWarning when using the dataset?
  • Deprecate the dataset pointing to the alternative dataset (California Housing and Ames Housing) and also the original version in OpenML that is encouraged for studying ethics?

@glemaitre
Copy link
Member Author

ping @adrinjalali @ogrisel could you explicitly mentioned which one of the TODO to actually do (it could be both indeed)?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of finalizing this.

I think I would be in favor of raising a warning by default with a way to silence the warning, but not to deprecate this function because it is used by so many code snippets, especially in tutorials, ML courseworks, books and so on. Books in particular are problematic because they cannot easily be reprinted. Having the permanent warning by default would be a way to spread the message to all the readers of those resources.

To silence the warning, we could have something explicit such as:

>>> X, y = load_boston(acknowledge_ethical_problem=True)

@ogrisel
Copy link
Member

ogrisel commented Aug 11, 2021

Re-reading the discussion at #18594 (comment) and here is an alternative suggestion:

  • in the short term: raise a FutureWarning that both states the problem, proposes alternatives with copy-pastable lines (including the alternative to fetch the Boston dataset from OpenML) and warn that in version 1.2 this function will stop returning the datasets and instead raise an Exception
  • in the long term: keep the function but raise an Exception with the statements and the copy-pastable alternatives.

This way, readers of old ML books will always get an informative error message instead of an ImportError.

@ogrisel
Copy link
Member

ogrisel commented Aug 11, 2021

Alternative to the alternative: instead of keeping the function in 1.2 and have it raise an exception, we could implement a module level __getattr__ function in sklearn.datasets to serve a similar purpose but would raise the exception at import time instead of call time. See: https://www.python.org/dev/peps/pep-0562/

@glemaitre
Copy link
Member Author

I make the deprecation and added code snippets to alternative datasets. I added a TODO note for the change to do in 1.2

In this case special case, you can fetch the dataset from OpenML:

from sklearn.datasets import fetch_openml
boston = fetch_openml('boston', version=2, as_frame=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not a drop-in replacement. The .data matrix is the same but the .target attribute holds values in {'N', 'P'} instead of the price information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check if there is some alternative OpenML

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use version=1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but np.testing.assert_allclose says it is not the same. Let me check if this is just the column order

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it should be possible to use pandas.read_csv with clever options to directly read the original:

http://lib.stat.cmu.edu/datasets/boston

Copy link
Member

@ogrisel ogrisel Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! The following lines make it possible to reconstruct the same arrays as in scikit-learn from the original text file (I checked):

import pandas as pd
import numpy as np


data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can keep the fetch_openml indeed.

Copy link
Member

@ogrisel ogrisel Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather include the code snippet above instead of relying on fetch_openml:

  • this is using the original source dataset
  • it is a direct drop-in replacement for the numpy arrays returned by load_boston.
  • (it's faster than ARFF parsing ;)

As you show in #20729 (review), it's not easy to get the drop-in replacement from fetch_openml because of the presence of ordinal encoded categories with a non-zero offset.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass of comments:

glemaitre and others added 2 commits August 11, 2021 15:41
@ogrisel
Copy link
Member

ogrisel commented Aug 11, 2021

For information from the website statistics of the pass 12 months:

image

Just to make the point that it might make sense to have a longer deprecation strategy than usual for this particular function removal.

for name, cats in boston_openml.categories.items():
cat_idx = boston_openml.feature_names.index(name)
cats = np.asarray(cats, dtype=np.float64)
data_openml[:, cat_idx] = cats[data_openml[:, cat_idx].astype(np.int64)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not expecting the gymnastic with the categories here.

@ogrisel ogrisel changed the title DOC add warning regarding the load_boston DOC add warning regarding the load_boston function Aug 11, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM once #20729 (comment) is addressed.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with either the existing or @ogrisel 's suggestions (ref #20729 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
3 participants