Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] removed boston dataset #18594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

[MRG] removed boston dataset #18594

wants to merge 1 commit into from

Conversation

amy12xx
Copy link
Contributor

@amy12xx amy12xx commented Oct 10, 2020

towards: #16155
Removing all the references to the boston dataset from the docs

Reference Issues/PRs

Removing all the references to the boston dataset from the docs

What does this implement/fix? Explain your changes.

Any other comments?

Closes #17351

worked on by @ezebunandu and @amy12xx

@amy12xx
Copy link
Contributor Author

amy12xx commented Oct 10, 2020

does this need a whats new entry?

@amy12xx amy12xx marked this pull request as ready for review October 10, 2020 17:54
@amy12xx amy12xx changed the title [DRAFT] removed boston dataset [MRG] removed boston dataset Oct 10, 2020
@reshamas
Copy link
Member

@amy12xx @ezebunandu
I am glad to see you both here again, returning after the June 2020 Data Umbrella sprint. This is great!

@thomasjpfan
What would be good label(s) for this PR?

I did a search of the scikit-learn repo, and saw that the boston data showed up in a bunch of other files:
https://github.com/scikit-learn/scikit-learn/search?q=boston

@amy12xx
Copy link
Contributor Author

amy12xx commented Oct 11, 2020

@reshamas yes, the pair programming was a great idea, we decided to continue with it.

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2020

I labeled this as doc. Before merging this, shall we officially deprecate the load_boston function? I don't want to deprecate it too abruptly because it's used in numerous documentations (including tutorial on fairness issues in Machine Learning).

If we decide not to deprecate, we should keep the link to the API doc but we should expand the docstring to explain why this dataset is problematic.

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2020

I think I would be in favor of not deprecating but making it clear in the docstring that this dataset has an ethical problem. Let me try to suggest the following paragraph as a draft we can collectively refine to reach a consensus:

"""
Warning: the Boston housing prices dataset has an ethical problem: as investigated in [ref1], the authors of this dataset engineered a non-invertible variable "B" assuming that racial self-segregation had a positive impact on house prices [ref2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

[ref1] https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
[ref2] https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air
"""

@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2020

As @jni suggested on the mailing list, we can also issue a user warning with the content of the above paragraph when calling the function the default parameters (and maybe also a non-default option to silence it) in addition to adding the paragraph to the docstring.

@NicolasHug
Copy link
Member

I like the warning in the docs.

I think we need to keep a reference to the function at least in the API ref: when a function isn't referenced anywhere, sphinx does not render it at all and it becomes impossible to find. But IMHO if we have a doc warning both in the docstring and in the "toy datasets" section, we don't need to remove any ref. Maybe we can just move them at the bottom.

Regarding a user warning in the code: no strong opinion. I just think we shouldn't raise a warning unless we plan to remove the loader eventually, or if we allow to disable the warning as suggested.

@sherbold
Copy link

@ogrisel suggestion sounds perfect for me. I also use the data set in education and specifically discuss the ethical issues within lectures and exercises. Ethical problems with data and AI do not go away when we hide them, they can only go away when we teach spotting and avoiding them. Adding a comment with references to the docstring sounds like a perfect solution for this.

@ogrisel
Copy link
Member

ogrisel commented Oct 14, 2020

Consolidating @adrinjalali's reply on the mailing here for people not registered on the mailing list:

Let's talk about the alternatives we have:

Keep the loader, but raise a warning:

  • this will result in most people not changing their code/material, and IMO mostly ignore the warning. Some
    people may see the warning and care about it.

Deprecate, and point them to an alternative dataset, and if they really really want the same dataset, point them
to the openml ID:

  • People will have to change something, and if we give them a nice copy/paste-able alternative which is not boston,
    they'll use that instead.
  • Some people will keep using boston from openml, and not care about the ethical implications

As an addition, we can keep the load_boston in the docs only, and point users to alternatives even after removing
the loader.

@ogrisel
Copy link
Member

ogrisel commented Mar 8, 2021

People will have to change something, and if we give them a nice copy/paste-able alternative which is not boston,
they'll use that instead.
Some people will keep using boston from openml, and not care about the ethical implications

But if they do that, we should also points to alternative related datasets that do not have the "segregationist variable problem", for instance fetch_california_housing and the Ames housing datasset on OpenML.

@adrinjalali
Copy link
Member

But if they do that, we should also points to alternative related datasets that do not have the "segregationist variable problem", for instance fetch_california_housing and the Ames housing datasset on OpenML.

Sounds good to me. There are still legit usecases of using a problematic dataset, e.g. discussing the problematic aspects of the dataset :D but overall, I'm happy with the combination of these solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants