-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Closed
Labels
Description
From fairlearn/fairlearn#1166, we have an inconsistency between liac-arff and pandas parser.
From the ARFF specs, the leading whitespaces are ignored if not between quotes. The pandas read_csv will include this space by default. E.g.
>>> import sklearn.datasets as skd
>>> d = skd.fetch_openml(data_id=1590, as_frame=True, parser='pandas')
>>> d.target
0 <=50K
1 <=50K
2 >50K
3 >50K
4 <=50K
...
48837 <=50K
48838 >50K
48839 <=50K
48840 <=50K
48841 >50K
Name: class, Length: 48842, dtype: category
Categories (2, object): [' <=50K', ' >50K']I am unsure that we can easily solve the issue because once read by read_csv, we don't have the information about the quotes anymore. I assume that the best that we can provide is to pass any additional keyword argument to read_csv to make it flexible enough.