Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Inconsistency between liac-arff and pandas parser in fetch_openml #25311

@glemaitre

Description

@glemaitre

From fairlearn/fairlearn#1166, we have an inconsistency between liac-arff and pandas parser.

From the ARFF specs, the leading whitespaces are ignored if not between quotes. The pandas read_csv will include this space by default. E.g.

>>> import sklearn.datasets as skd
>>> d = skd.fetch_openml(data_id=1590, as_frame=True, parser='pandas')
>>> d.target
0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: category
Categories (2, object): [' <=50K', ' >50K']

I am unsure that we can easily solve the issue because once read by read_csv, we don't have the information about the quotes anymore. I assume that the best that we can provide is to pass any additional keyword argument to read_csv to make it flexible enough.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions