Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG OpenML version of adult does not work with pandas parser #1168

@MiroDudik

Description

@MiroDudik

Describe the bug

Scikit-learn 1.2 introduced a faster parser for OpenML files, but that parser leads to an incorrectly parsed adult dataset, so we need to continue using the slow parser. (See #1166.)

Steps/Code to Reproduce

>>> import sklearn.datasets as skd
>>> d = skd.fetch_openml(data_id=1590, as_frame=True, parser='pandas')
>>> d.target
0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: category
Categories (2, object): [' <=50K', ' >50K']

In the last line you see that there's an extra space in how the categorical variables are parsed. This is because the source CSV file has spaces after commas.

The proposed fix is to upload the version of adult without those spaces, and then we can replace the slow parser with the faster 'pandas' parser.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions