[MRG] removed boston dataset #18594

amy12xx · 2020-10-10T17:19:37Z

towards: #16155
Removing all the references to the boston dataset from the docs

Reference Issues/PRs

Removing all the references to the boston dataset from the docs

What does this implement/fix? Explain your changes.

Any other comments?

Closes #17351

worked on by @ezebunandu and @amy12xx

amy12xx · 2020-10-10T17:54:17Z

does this need a whats new entry?

reshamas · 2020-10-10T23:27:47Z

@amy12xx @ezebunandu
I am glad to see you both here again, returning after the June 2020 Data Umbrella sprint. This is great!

@thomasjpfan
What would be good label(s) for this PR?

I did a search of the scikit-learn repo, and saw that the boston data showed up in a bunch of other files:
https://github.com/scikit-learn/scikit-learn/search?q=boston

amy12xx · 2020-10-11T04:33:40Z

@reshamas yes, the pair programming was a great idea, we decided to continue with it.

ogrisel · 2020-10-13T09:02:45Z

I labeled this as doc. Before merging this, shall we officially deprecate the load_boston function? I don't want to deprecate it too abruptly because it's used in numerous documentations (including tutorial on fairness issues in Machine Learning).

If we decide not to deprecate, we should keep the link to the API doc but we should expand the docstring to explain why this dataset is problematic.

ogrisel · 2020-10-13T09:47:28Z

I think I would be in favor of not deprecating but making it clear in the docstring that this dataset has an ethical problem. Let me try to suggest the following paragraph as a draft we can collectively refine to reach a consensus:

"""
Warning: the Boston housing prices dataset has an ethical problem: as investigated in [ref1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [ref2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

[ref1] https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
[ref2] https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air
"""

ogrisel · 2020-10-13T12:26:20Z

As @jni suggested on the mailing list, we can also issue a user warning with the content of the above paragraph when calling the function the default parameters (and maybe also a non-default option to silence it) in addition to adding the paragraph to the docstring.

NicolasHug · 2020-10-13T12:43:39Z

I like the warning in the docs.

I think we need to keep a reference to the function at least in the API ref: when a function isn't referenced anywhere, sphinx does not render it at all and it becomes impossible to find. But IMHO if we have a doc warning both in the docstring and in the "toy datasets" section, we don't need to remove any ref. Maybe we can just move them at the bottom.

Regarding a user warning in the code: no strong opinion. I just think we shouldn't raise a warning unless we plan to remove the loader eventually, or if we allow to disable the warning as suggested.

sherbold · 2020-10-14T05:38:53Z

@ogrisel suggestion sounds perfect for me. I also use the data set in education and specifically discuss the ethical issues within lectures and exercises. Ethical problems with data and AI do not go away when we hide them, they can only go away when we teach spotting and avoiding them. Adding a comment with references to the docstring sounds like a perfect solution for this.

ogrisel · 2020-10-14T13:09:46Z

Consolidating @adrinjalali's reply on the mailing here for people not registered on the mailing list:

Let's talk about the alternatives we have:

Keep the loader, but raise a warning:

this will result in most people not changing their code/material, and IMO mostly ignore the warning. Some
people may see the warning and care about it.

Deprecate, and point them to an alternative dataset, and if they really really want the same dataset, point them
to the openml ID:

People will have to change something, and if we give them a nice copy/paste-able alternative which is not boston,
they'll use that instead.

Some people will keep using boston from openml, and not care about the ethical implications

As an addition, we can keep the load_boston in the docs only, and point users to alternatives even after removing
the loader.

ogrisel · 2021-03-08T10:50:17Z

People will have to change something, and if we give them a nice copy/paste-able alternative which is not boston,
they'll use that instead.
Some people will keep using boston from openml, and not care about the ethical implications

But if they do that, we should also points to alternative related datasets that do not have the "segregationist variable problem", for instance fetch_california_housing and the Ames housing datasset on OpenML.

adrinjalali · 2021-03-08T11:55:57Z

But if they do that, we should also points to alternative related datasets that do not have the "segregationist variable problem", for instance fetch_california_housing and the Ames housing datasset on OpenML.

Sounds good to me. There are still legit usecases of using a problematic dataset, e.g. discussing the problematic aspects of the dataset :D but overall, I'm happy with the combination of these solutions.

removed boston dataset

e1ff614

amy12xx marked this pull request as ready for review October 10, 2020 17:54

amy12xx changed the title ~~[DRAFT] removed boston dataset~~ [MRG] removed boston dataset Oct 10, 2020

ogrisel added the Documentation label Oct 13, 2020

Base automatically changed from master to main January 22, 2021 10:53

jj-jr mentioned this pull request Feb 2, 2021

Racial discrimination in the 'B' feature of the Boston housing dataset mckinsey/causalnex#92

Closed

cmarmo added the Stalled label Mar 8, 2021

cmarmo mentioned this pull request Mar 8, 2021

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

Closed

dringel mentioned this pull request Mar 10, 2021

Inappropriate Language in scikit-learn: Boston house prices dataset Description #19657

Closed

glemaitre removed the Stalled label Aug 10, 2021

glemaitre self-assigned this Aug 10, 2021

glemaitre requested review from glemaitre and removed request for glemaitre August 10, 2021 17:28

glemaitre removed their assignment Aug 10, 2021

glemaitre mentioned this pull request Aug 10, 2021

DOC add warning regarding the load_boston function #20729

Merged

2 tasks

glemaitre closed this in #20729 Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] removed boston dataset #18594

[MRG] removed boston dataset #18594

Uh oh!

amy12xx commented Oct 10, 2020 •

edited

Loading

Uh oh!

amy12xx commented Oct 10, 2020

Uh oh!

reshamas commented Oct 10, 2020

Uh oh!

amy12xx commented Oct 11, 2020

Uh oh!

ogrisel commented Oct 13, 2020

Uh oh!

ogrisel commented Oct 13, 2020 •

edited by NicolasHug

Loading

Uh oh!

ogrisel commented Oct 13, 2020 •

edited

Loading

Uh oh!

NicolasHug commented Oct 13, 2020

Uh oh!

sherbold commented Oct 14, 2020

Uh oh!

ogrisel commented Oct 14, 2020 •

edited

Loading

Uh oh!

ogrisel commented Mar 8, 2021

Uh oh!

adrinjalali commented Mar 8, 2021

Uh oh!

Uh oh!

Uh oh!

[MRG] removed boston dataset #18594

[MRG] removed boston dataset #18594

Uh oh!

Conversation

amy12xx commented Oct 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amy12xx commented Oct 10, 2020

Uh oh!

reshamas commented Oct 10, 2020

Uh oh!

amy12xx commented Oct 11, 2020

Uh oh!

ogrisel commented Oct 13, 2020

Uh oh!

ogrisel commented Oct 13, 2020 • edited by NicolasHug Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Oct 13, 2020

Uh oh!

sherbold commented Oct 14, 2020

Uh oh!

ogrisel commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Mar 8, 2021

Uh oh!

adrinjalali commented Mar 8, 2021

Uh oh!

Uh oh!

amy12xx commented Oct 10, 2020 •

edited

Loading

ogrisel commented Oct 13, 2020 •

edited by NicolasHug

Loading

ogrisel commented Oct 13, 2020 •

edited

Loading

ogrisel commented Oct 14, 2020 •

edited

Loading