FIX make pandas and liac arff parser quoting behaviour closer #23497

glemaitre · 2022-05-31T09:17:34Z

Stripping the quotes around the string to be consistent between pandas and LIAC ARFF parser.

glemaitre · 2022-05-31T09:18:29Z

Since we did not release, there is no need for a change log entry.

glemaitre · 2022-05-31T09:18:48Z

@lesteve I think that this is something like this that you had in mind.

glemaitre · 2022-05-31T10:13:25Z

~~I would argue that the last error reported for Titanic is a bug in LIAC-ARFF. It does not only strip the left and right quotes but the inner quotes. For instance:~~

"Lovell, Mr. John Hall ('Henry')"

~~will be parsed as:~~

"Lovell, Mr. John Hall (Henry)"

~~while the current PR will not intend to remove the quotes around Henry because it is part of the text.~~

Edit: Actually this is not the case. I have to check more.
Edit 2: I was stripping all quotes without checking if it was starting and ending with quotes.

sklearn/datasets/_arff_parser.py

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan

Thank you for the PR!

sklearn/datasets/tests/test_openml.py

sklearn/datasets/_arff_parser.py

sklearn/datasets/tests/test_openml.py

sklearn/datasets/_arff_parser.py

ogrisel · 2022-05-31T17:00:27Z

I was rumbling in a thread on outdated diff and it is now hidden because of more recent changes. The important comment was:

I am not sure if we should process the values in isolation or based on the column-wide @attribute metadata that should be explicit about the quotation style of the column.

But maybe here we just aim at reproducing the behavior of the old "liac-arff" parser without aiming for correctness and clean data will be fetched by using unambiguous file format such as parquet in the future.

Thinking a but more about this, maybe we could do a study on the available parquet datasets on openml.org to use them as a source of ground truth to make future-proof design decisions on the behavior of the pandas parser: if we don't know if we should automatically strip or not the quotes of nominal variables, maybe we should try to load a few datasets with the pandas / pyarrow integration an look at the values in the resulting pandas dataframe.

ogrisel · 2022-05-31T17:03:29Z

Parsing the ARFF attribute metadata should not be difficult if they can help resolve any ambiguity in the CSV payload section of the ARFF file (after the @data line). Feel free to open any dataset from openml.org in a text editor by dowloading the ARFF file by clicking on the Download icon to see what I mean.

Here is the spec:

https://waikato.github.io/weka-wiki/formats_and_processing/arff_stable/

sklearn/datasets/_arff_parser.py

sklearn/datasets/tests/test_arff_parser.py

thomasjpfan

Thank you for the update! I confirm that test_fetch_openml_strip_quotes fails on `main.

sklearn/datasets/_arff_parser.py

sklearn/datasets/tests/test_arff_parser.py

ogrisel

Thanks for the added tests, it helps build some trust in the correctness of our parsing choices.

Here are some suggestions to strengthen them even further. Assuming those extended test suggestions pass as intended, LGTM.

sklearn/datasets/_arff_parser.py

sklearn/datasets/tests/data/toy_quotes.arff

sklearn/datasets/tests/test_arff_parser.py

sklearn/datasets/tests/test_openml.py

sklearn/datasets/_openml.py

ogrisel

I clarified the inline-comment to explicitly state our policy w.r.t. single quote handling.

I also pushed a cosmetic change to use texwrap.dedent in the tests.

I will trigger a full-doc build to check that this does not break any example but otherwise, LGTM!

sklearn/datasets/_arff_parser.py

glemaitre · 2022-06-09T09:05:20Z

The documentation build passed see https://github.com/scikit-learn/scikit-learn/runs/6808705821?check_suite_focus=true

Co-authored-by: Olivier Grisel <[email protected]>

ogrisel · 2022-06-09T13:35:45Z

@thomasjpfan I think this PR is in good shape for final review.

thomasjpfan

Minor nit, otherwise LGTM

sklearn/datasets/tests/test_arff_parser.py

Co-authored-by: Thomas J. Fan <[email protected]>

lesteve · 2022-06-09T14:31:26Z

Merging this one, thanks!

…-learn#23497) Co-authored-by: Olivier Grisel <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

FIX strig quotes to be consistent between pandas and liac arff parser

2aa2f58

github-actions bot added the module:datasets label May 31, 2022

glemaitre added the No Changelog Needed label May 31, 2022

glemaitre changed the title ~~FIX strig quotes to be consistent between pandas and liac arff parser~~ FIX strip quotes to be consistent between pandas and liac arff parser May 31, 2022

glemaitre added 2 commits May 31, 2022 12:06

avoid casting

cbc8f25

remove debug

cc8c3b2

glemaitre added 2 commits May 31, 2022 12:59

only strip starting and ending quotes

7b32fcb

iter

41716ed

ogrisel reviewed May 31, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

Update sklearn/datasets/_arff_parser.py

33ed5c4

Co-authored-by: Olivier Grisel <[email protected]>

thomasjpfan reviewed May 31, 2022

View reviewed changes

sklearn/datasets/tests/test_openml.py Outdated Show resolved Hide resolved

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

iter

ec4df9f

ogrisel reviewed May 31, 2022

View reviewed changes

sklearn/datasets/tests/test_openml.py Outdated Show resolved Hide resolved

ogrisel reviewed May 31, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

thomasjpfan reviewed May 31, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

iter

2acc0e8

ogrisel reviewed May 31, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

iter

6fe8b6d

glemaitre added 2 commits June 1, 2022 11:02

iter

3896c60

use monkey patching

edc1940

glemaitre commented Jun 1, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Show resolved Hide resolved

glemaitre added 2 commits June 1, 2022 11:08

less diff

a911698

Merge remote-tracking branch 'origin/main' into is/23381

9d53b76

lesteve reviewed Jun 1, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

iter

8429546

glemaitre added 4 commits June 1, 2022 11:29

lesteve comment

ea1913c

iter

19bb8d1

remove debug

f30509e

TST add a simple ARFF file with some quotes

8f64e7e

glemaitre commented Jun 1, 2022

View reviewed changes

sklearn/datasets/tests/test_arff_parser.py Outdated Show resolved Hide resolved

avoid replacing when not needed

f5e8a1f

thomasjpfan reviewed Jun 7, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_arff_parser.py Outdated Show resolved Hide resolved

sklearn/datasets/tests/test_arff_parser.py Outdated Show resolved Hide resolved

glemaitre added 2 commits June 8, 2022 10:24

Merge remote-tracking branch 'origin/main' into is/23381

da24e40

iter

3896e7e

ogrisel reviewed Jun 8, 2022

View reviewed changes

glemaitre added 3 commits June 8, 2022 16:39

iter

7f92eb3

iter

cc205e3

doc

cbb501e

thomasjpfan reviewed Jun 8, 2022

View reviewed changes

sklearn/datasets/_openml.py Show resolved Hide resolved

Use textwrap.dedent for inline ARFF payloads in tests

ad39cfd

ogrisel approved these changes Jun 9, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

Trigger [doc build]

02450d1

glemaitre and others added 4 commits June 9, 2022 11:05

Update sklearn/datasets/_arff_parser.py

5e0c91a

Co-authored-by: Olivier Grisel <[email protected]>

add info doc

fccd710

Merge remote-tracking branch 'origin/main' into is/23381

6fd6255

lint

12411a0

thomasjpfan approved these changes Jun 9, 2022

View reviewed changes

sklearn/datasets/tests/test_arff_parser.py Outdated Show resolved Hide resolved

Update sklearn/datasets/tests/test_arff_parser.py

a0a4a1d

Co-authored-by: Thomas J. Fan <[email protected]>

lesteve changed the title ~~FIX strip quotes to be consistent between pandas and liac arff parser~~ FIX make pandas and liac arff parser quoting behaviour closer Jun 9, 2022

lesteve merged commit 8515b48 into scikit-learn:main Jun 9, 2022

Uh oh!

FIX make pandas and liac arff parser quoting behaviour closer #23497

FIX make pandas and liac arff parser quoting behaviour closer #23497

Uh oh!

Conversation

glemaitre commented May 31, 2022

Uh oh!

glemaitre commented May 31, 2022

Uh oh!

glemaitre commented May 31, 2022

Uh oh!

glemaitre commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Jun 9, 2022 • edited by lesteve Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jun 9, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Jun 9, 2022

Uh oh!

Uh oh!

glemaitre commented May 31, 2022 •

edited

Loading

ogrisel commented May 31, 2022 •

edited

Loading

ogrisel commented May 31, 2022 •

edited

Loading

glemaitre commented Jun 9, 2022 •

edited by lesteve

Loading