-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
[MRG] removed boston dataset #18594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] removed boston dataset #18594
Conversation
does this need a whats new entry? |
@amy12xx @ezebunandu @thomasjpfan I did a search of the scikit-learn repo, and saw that the boston data showed up in a bunch of other files: |
@reshamas yes, the pair programming was a great idea, we decided to continue with it. |
I labeled this as doc. Before merging this, shall we officially deprecate the If we decide not to deprecate, we should keep the link to the API doc but we should expand the docstring to explain why this dataset is problematic. |
I think I would be in favor of not deprecating but making it clear in the docstring that this dataset has an ethical problem. Let me try to suggest the following paragraph as a draft we can collectively refine to reach a consensus: """ The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning. [ref1] https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8 |
As @jni suggested on the mailing list, we can also issue a user warning with the content of the above paragraph when calling the function the default parameters (and maybe also a non-default option to silence it) in addition to adding the paragraph to the docstring. |
I like the warning in the docs. I think we need to keep a reference to the function at least in the API ref: when a function isn't referenced anywhere, sphinx does not render it at all and it becomes impossible to find. But IMHO if we have a doc warning both in the docstring and in the "toy datasets" section, we don't need to remove any ref. Maybe we can just move them at the bottom. Regarding a user warning in the code: no strong opinion. I just think we shouldn't raise a warning unless we plan to remove the loader eventually, or if we allow to disable the warning as suggested. |
@ogrisel suggestion sounds perfect for me. I also use the data set in education and specifically discuss the ethical issues within lectures and exercises. Ethical problems with data and AI do not go away when we hide them, they can only go away when we teach spotting and avoiding them. Adding a comment with references to the docstring sounds like a perfect solution for this. |
Consolidating @adrinjalali's reply on the mailing here for people not registered on the mailing list:
|
But if they do that, we should also points to alternative related datasets that do not have the "segregationist variable problem", for instance |
Sounds good to me. There are still legit usecases of using a problematic dataset, e.g. discussing the problematic aspects of the dataset :D but overall, I'm happy with the combination of these solutions. |
towards: #16155
Removing all the references to the boston dataset from the docs
Reference Issues/PRs
Removing all the references to the boston dataset from the docs
What does this implement/fix? Explain your changes.
Any other comments?
Closes #17351
worked on by @ezebunandu and @amy12xx