DOC add warning regarding the load_boston function #20729

glemaitre · 2021-08-10T18:02:59Z

Add permanent warning in the documentation.

I think that we still need to settle on the following because it was not super clear from the last message in #18594 (at least to my understanding).

TODO:

Add a UserWarning when using the dataset?
Deprecate the dataset pointing to the alternative dataset (California Housing and Ames Housing) and also the original version in OpenML that is encouraged for studying ethics?

glemaitre · 2021-08-10T18:04:11Z

ping @adrinjalali @ogrisel could you explicitly mentioned which one of the TODO to actually do (it could be both indeed)?

ogrisel

Thanks for taking care of finalizing this.

I think I would be in favor of raising a warning by default with a way to silence the warning, but not to deprecate this function because it is used by so many code snippets, especially in tutorials, ML courseworks, books and so on. Books in particular are problematic because they cannot easily be reprinted. Having the permanent warning by default would be a way to spread the message to all the readers of those resources.

To silence the warning, we could have something explicit such as:

>>> X, y = load_boston(acknowledge_ethical_problem=True)

sklearn/datasets/_base.py

ogrisel · 2021-08-11T10:13:50Z

Re-reading the discussion at #18594 (comment) and here is an alternative suggestion:

in the short term: raise a FutureWarning that both states the problem, proposes alternatives with copy-pastable lines (including the alternative to fetch the Boston dataset from OpenML) and warn that in version 1.2 this function will stop returning the datasets and instead raise an Exception
in the long term: keep the function but raise an Exception with the statements and the copy-pastable alternatives.

This way, readers of old ML books will always get an informative error message instead of an ImportError.

ogrisel · 2021-08-11T10:17:27Z

Alternative to the alternative: instead of keeping the function in 1.2 and have it raise an exception, we could implement a module level __getattr__ function in sklearn.datasets to serve a similar purpose but would raise the exception at import time instead of call time. See: https://www.python.org/dev/peps/pep-0562/

glemaitre · 2021-08-11T11:18:46Z

I make the deprecation and added code snippets to alternative datasets. I added a TODO note for the change to do in 1.2

ogrisel · 2021-08-11T13:04:21Z

sklearn/datasets/_base.py

+    In this case special case, you can fetch the dataset from OpenML:
+
+        from sklearn.datasets import fetch_openml
+        boston = fetch_openml('boston', version=2, as_frame=False)


Unfortunately this is not a drop-in replacement. The .data matrix is the same but the .target attribute holds values in {'N', 'P'} instead of the price information.

let me check if there is some alternative OpenML

We need to use version=1

but np.testing.assert_allclose says it is not the same. Let me check if this is just the column order

Otherwise it should be possible to use pandas.read_csv with clever options to directly read the original:

http://lib.stat.cmu.edu/datasets/boston

Got it! The following lines make it possible to reconstruct the same arrays as in scikit-learn from the original text file (I checked):

import pandas as pd import numpy as np data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) target = raw_df.values[1::2, 2]

I think that we can keep the fetch_openml indeed.

I would rather include the code snippet above instead of relying on fetch_openml:

this is using the original source dataset

it is a direct drop-in replacement for the numpy arrays returned by load_boston.

(it's faster than ARFF parsing ;)

As you show in #20729 (review), it's not easy to get the drop-in replacement from fetch_openml because of the presence of ordinal encoded categories with a non-zero offset.

ogrisel

First pass of comments:

sklearn/datasets/_base.py

setup.cfg

doc/whats_new/v1.0.rst

sklearn/datasets/tests/test_base.py

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel · 2021-08-11T13:45:42Z

For information from the website statistics of the pass 12 months:

Just to make the point that it might make sense to have a longer deprecation strategy than usual for this particular function removal.

glemaitre · 2021-08-11T14:22:50Z

sklearn/datasets/tests/test_base.py

+    for name, cats in boston_openml.categories.items():
+        cat_idx = boston_openml.feature_names.index(name)
+        cats = np.asarray(cats, dtype=np.float64)
+        data_openml[:, cat_idx] = cats[data_openml[:, cat_idx].astype(np.int64)]


I was not expecting the gymnastic with the categories here.

ogrisel

Overall LGTM once #20729 (comment) is addressed.

adrinjalali

I'm happy with either the existing or @ogrisel 's suggestions (ref #20729 (comment))

Co-authored-by: Olivier Grisel <[email protected]>

DOC add warning regarding the load_boston

6ba8703

github-actions bot added module:datasets Documentation labels Aug 10, 2021

ogrisel reviewed Aug 11, 2021

View reviewed changes

sklearn/datasets/_base.py Show resolved Hide resolved

glemaitre added 4 commits August 11, 2021 12:29

raise a UserWarning

80a223c

iter

0338c3e

ityer

2f52edf

revert setup.cfg

e96d8a9

glemaitre added 3 commits August 11, 2021 13:58

iter

6fcf378

iter

445a197

iter

dc35d9d

ogrisel reviewed Aug 11, 2021

View reviewed changes

glemaitre added 4 commits August 11, 2021 15:07

iter

251e579

iter

84a4045

iter

251d582

iter

558e367

ogrisel reviewed Aug 11, 2021

View reviewed changes

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

sklearn/datasets/_base.py Show resolved Hide resolved

sklearn/datasets/_base.py Outdated Show resolved Hide resolved

setup.cfg Outdated Show resolved Hide resolved

doc/whats_new/v1.0.rst Show resolved Hide resolved

glemaitre commented Aug 11, 2021

View reviewed changes

sklearn/datasets/tests/test_base.py Outdated Show resolved Hide resolved

glemaitre and others added 2 commits August 11, 2021 15:41

Apply suggestions from code review

973a8be

Co-authored-by: Olivier Grisel <[email protected]>

Apply suggestions from code review

1fdd138

Co-authored-by: Olivier Grisel <[email protected]>

glemaitre added 2 commits August 11, 2021 15:56

iter

77908e5

iter

f8d6390

glemaitre commented Aug 11, 2021

View reviewed changes

iter

4145428

ogrisel changed the title ~~DOC add warning regarding the load_boston~~ DOC add warning regarding the load_boston function Aug 11, 2021

ogrisel approved these changes Aug 12, 2021

View reviewed changes

adrinjalali approved these changes Aug 12, 2021

View reviewed changes

glemaitre added 8 commits August 16, 2021 10:51

iter

173d7a5

Merge remote-tracking branch 'origin/main' into is/16155

e237a34

iter

4358326

iter

4bcb7fa

iter

b8b4b51

iter

5216a60

iter

60e2bf6

iter

63c3c70

glemaitre merged commit c592361 into scikit-learn:main Aug 17, 2021

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

DOC add warning regarding the load_boston function (scikit-learn#20729)

b15457c

Co-authored-by: Olivier Grisel <[email protected]>

trevorstephens mentioned this pull request May 2, 2022

boston dataset being depreciated by scikitlearn trevorstephens/gplearn#257

Closed

eddiebergman mentioned this pull request Jun 15, 2022

Update requirements for SMAC, ConfigSpace, scikit-learn automl/auto-sklearn#1510

Closed

masmangan mentioned this pull request Jul 8, 2022

Replace Boston housing example with California housing example or Ames housing example scipy-lectures/scientific-python-lectures#501

Closed

This was referenced Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

[Maint] Update test fixtures to not use Boston automl/auto-sklearn#1612

Open

connortann mentioned this pull request May 10, 2023

Replace Boston dataset in examples dsgibbons/shap#8

Closed

Uh oh!

DOC add warning regarding the load_boston function #20729

DOC add warning regarding the load_boston function #20729

Uh oh!

Conversation

glemaitre commented Aug 10, 2021 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Aug 10, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Aug 11, 2021

Uh oh!

glemaitre commented Aug 11, 2021

Uh oh!

ogrisel Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre Aug 11, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Aug 10, 2021 •

edited by ogrisel

Loading

ogrisel commented Aug 11, 2021 •

edited

Loading

ogrisel Aug 11, 2021 •

edited

Loading

ogrisel Aug 11, 2021 •

edited

Loading

ogrisel commented Aug 11, 2021 •

edited

Loading