-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
DOC add warning regarding the load_boston function #20729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @adrinjalali @ogrisel could you explicitly mentioned which one of the TODO to actually do (it could be both indeed)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of finalizing this.
I think I would be in favor of raising a warning by default with a way to silence the warning, but not to deprecate this function because it is used by so many code snippets, especially in tutorials, ML courseworks, books and so on. Books in particular are problematic because they cannot easily be reprinted. Having the permanent warning by default would be a way to spread the message to all the readers of those resources.
To silence the warning, we could have something explicit such as:
>>> X, y = load_boston(acknowledge_ethical_problem=True)
Re-reading the discussion at #18594 (comment) and here is an alternative suggestion:
This way, readers of old ML books will always get an informative error message instead of an |
Alternative to the alternative: instead of keeping the function in 1.2 and have it raise an exception, we could implement a module level |
I make the deprecation and added code snippets to alternative datasets. I added a |
sklearn/datasets/_base.py
Outdated
In this case special case, you can fetch the dataset from OpenML: | ||
|
||
from sklearn.datasets import fetch_openml | ||
boston = fetch_openml('boston', version=2, as_frame=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is not a drop-in replacement. The .data
matrix is the same but the .target
attribute holds values in {'N', 'P'}
instead of the price information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me check if there is some alternative OpenML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use version=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but np.testing.assert_allclose
says it is not the same. Let me check if this is just the column order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it should be possible to use pandas.read_csv
with clever options to directly read the original:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! The following lines make it possible to reconstruct the same arrays as in scikit-learn from the original text file (I checked):
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can keep the fetch_openml
indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather include the code snippet above instead of relying on fetch_openml
:
- this is using the original source dataset
- it is a direct drop-in replacement for the numpy arrays returned by
load_boston
. - (it's faster than ARFF parsing ;)
As you show in #20729 (review), it's not easy to get the drop-in replacement from fetch_openml
because of the presence of ordinal encoded categories with a non-zero offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass of comments:
Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]>
sklearn/datasets/tests/test_base.py
Outdated
for name, cats in boston_openml.categories.items(): | ||
cat_idx = boston_openml.feature_names.index(name) | ||
cats = np.asarray(cats, dtype=np.float64) | ||
data_openml[:, cat_idx] = cats[data_openml[:, cat_idx].astype(np.int64)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not expecting the gymnastic with the categories
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM once #20729 (comment) is addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with either the existing or @ogrisel 's suggestions (ref #20729 (comment))
Co-authored-by: Olivier Grisel <[email protected]>
closes #16155
closes #18594
Add permanent warning in the documentation.
I think that we still need to settle on the following because it was not super clear from the last message in #18594 (at least to my understanding).
TODO:
UserWarning
when using the dataset?