-
Notifications
You must be signed in to change notification settings - Fork 478
Closed
Description
Describe the bug
Scikit-learn 1.2 introduced a faster parser for OpenML files, but that parser leads to an incorrectly parsed adult dataset, so we need to continue using the slow parser. (See #1166.)
Steps/Code to Reproduce
>>> import sklearn.datasets as skd
>>> d = skd.fetch_openml(data_id=1590, as_frame=True, parser='pandas')
>>> d.target
0 <=50K
1 <=50K
2 >50K
3 >50K
4 <=50K
...
48837 <=50K
48838 >50K
48839 <=50K
48840 <=50K
48841 >50K
Name: class, Length: 48842, dtype: category
Categories (2, object): [' <=50K', ' >50K']In the last line you see that there's an extra space in how the categorical variables are parsed. This is because the source CSV file has spaces after commas.
The proposed fix is to upload the version of adult without those spaces, and then we can replace the slow parser with the faster 'pandas' parser.