Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX make pandas and liac arff parser quoting behaviour closer #23497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 9, 2022

Conversation

glemaitre
Copy link
Member

closes #23381

Stripping the quotes around the string to be consistent between pandas and LIAC ARFF parser.

@glemaitre
Copy link
Member Author

Since we did not release, there is no need for a change log entry.

@glemaitre
Copy link
Member Author

@lesteve I think that this is something like this that you had in mind.

@glemaitre glemaitre changed the title FIX strig quotes to be consistent between pandas and liac arff parser FIX strip quotes to be consistent between pandas and liac arff parser May 31, 2022
@glemaitre
Copy link
Member Author

glemaitre commented May 31, 2022

I would argue that the last error reported for Titanic is a bug in LIAC-ARFF. It does not only strip the left and right quotes but the inner quotes. For instance:

"Lovell, Mr. John Hall ('Henry')"

will be parsed as:

"Lovell, Mr. John Hall (Henry)"

while the current PR will not intend to remove the quotes around Henry because it is part of the text.

Edit: Actually this is not the case. I have to check more.
Edit 2: I was stripping all quotes without checking if it was starting and ending with quotes.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

@ogrisel
Copy link
Member

ogrisel commented May 31, 2022

I was rumbling in a thread on outdated diff and it is now hidden because of more recent changes. The important comment was:

I am not sure if we should process the values in isolation or based on the column-wide @attribute metadata that should be explicit about the quotation style of the column.

But maybe here we just aim at reproducing the behavior of the old "liac-arff" parser without aiming for correctness and clean data will be fetched by using unambiguous file format such as parquet in the future.

Thinking a but more about this, maybe we could do a study on the available parquet datasets on openml.org to use them as a source of ground truth to make future-proof design decisions on the behavior of the pandas parser: if we don't know if we should automatically strip or not the quotes of nominal variables, maybe we should try to load a few datasets with the pandas / pyarrow integration an look at the values in the resulting pandas dataframe.

@ogrisel
Copy link
Member

ogrisel commented May 31, 2022

Parsing the ARFF attribute metadata should not be difficult if they can help resolve any ambiguity in the CSV payload section of the ARFF file (after the @data line). Feel free to open any dataset from openml.org in a text editor by dowloading the ARFF file by clicking on the Download icon to see what I mean.

Here is the spec:

https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update! I confirm that test_fetch_openml_strip_quotes fails on `main.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added tests, it helps build some trust in the correctness of our parsing choices.

Here are some suggestions to strengthen them even further. Assuming those extended test suggestions pass as intended, LGTM.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the inline-comment to explicitly state our policy w.r.t. single quote handling.

I also pushed a cosmetic change to use texwrap.dedent in the tests.

I will trigger a full-doc build to check that this does not break any example but otherwise, LGTM!

@glemaitre
Copy link
Member Author

glemaitre commented Jun 9, 2022

@ogrisel
Copy link
Member

ogrisel commented Jun 9, 2022

@thomasjpfan I think this PR is in good shape for final review.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, otherwise LGTM

@lesteve
Copy link
Member

lesteve commented Jun 9, 2022

Merging this one, thanks!

@lesteve lesteve changed the title FIX strip quotes to be consistent between pandas and liac arff parser FIX make pandas and liac arff parser quoting behaviour closer Jun 9, 2022
@lesteve lesteve merged commit 8515b48 into scikit-learn:main Jun 9, 2022
ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
…-learn#23497)

Co-authored-by: Olivier Grisel <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Co-authored-by: Loïc Estève <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fetch_openml difference between pandas and liac-arff parser
4 participants