-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX make pandas and liac arff parser quoting behaviour closer #23497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Since we did not release, there is no need for a change log entry. |
@lesteve I think that this is something like this that you had in mind. |
Edit: Actually this is not the case. I have to check more. |
Co-authored-by: Olivier Grisel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
I was rumbling in a thread on outdated diff and it is now hidden because of more recent changes. The important comment was:
Thinking a but more about this, maybe we could do a study on the available parquet datasets on openml.org to use them as a source of ground truth to make future-proof design decisions on the behavior of the pandas parser: if we don't know if we should automatically strip or not the quotes of nominal variables, maybe we should try to load a few datasets with the pandas / pyarrow integration an look at the values in the resulting pandas dataframe. |
Parsing the ARFF attribute metadata should not be difficult if they can help resolve any ambiguity in the CSV payload section of the ARFF file (after the Here is the spec: https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update! I confirm that test_fetch_openml_strip_quotes
fails on `main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the added tests, it helps build some trust in the correctness of our parsing choices.
Here are some suggestions to strengthen them even further. Assuming those extended test suggestions pass as intended, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified the inline-comment to explicitly state our policy w.r.t. single quote handling.
I also pushed a cosmetic change to use texwrap.dedent
in the tests.
I will trigger a full-doc build to check that this does not break any example but otherwise, LGTM!
The documentation build passed see https://github.com/scikit-learn/scikit-learn/runs/6808705821?check_suite_focus=true |
Co-authored-by: Olivier Grisel <[email protected]>
@thomasjpfan I think this PR is in good shape for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, otherwise LGTM
Co-authored-by: Thomas J. Fan <[email protected]>
Merging this one, thanks! |
…-learn#23497) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
closes #23381
Stripping the quotes around the string to be consistent between pandas and LIAC ARFF parser.