Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@glemaitre
Copy link
Member

closes #25311

Workaround for inconsistencies between liac-arff and pandas parser

@glemaitre glemaitre changed the title ENH provide a way to expose read_csv options FIX skip space after delimiter in fetch_openml when using pandas parser Jan 6, 2023
@glemaitre glemaitre added this to the 1.2.1 milestone Jan 6, 2023
@glemaitre
Copy link
Member Author

ping @adrinjalali since you already know the context

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new data-v1-* file is much larger (625K) compared to the other data files:

cd sklearn/datasets/tests/data/openml
find . -name '*data*' | xargs du -sh
4.0K	./id_561/data-v1-dl-52739.arff.gz
4.0K	./id_62/data-v1-dl-52352.arff.gz
8.0K	./id_40589/data-v1-dl-4644182.arff.gz
4.0K	./id_40675/data-v1-dl-4965250.arff.gz
4.0K	./id_1/data-v1-dl-1.arff.gz
8.0K	./id_42585/data-v1-dl-21854866.arff.gz
4.0K	./id_292/data-v1-dl-49822.arff.gz
652K	./id_1590/data-v1-dl-1595261.arff.gz
4.0K	./id_61/data-v1-dl-61.arff.gz
 32K	./id_40945/data-v1-dl-16826755.arff.gz
4.0K	./id_1119/data-v1-dl-54002.arff.gz
4.0K	./id_2/data-v1-dl-1666876.arff.gz
 20K	./id_3/data-v1-dl-3.arff.gz
8.0K	./id_40966/data-v1-dl-17928620.arff.gz

For testing, can we use a subset of the data instead?

@glemaitre
Copy link
Member Author

Yes, for the regression, I can select the 10 first samples. It would be more than enough.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a merge conflict, otherwise LGTM

@thomasjpfan thomasjpfan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 11, 2023
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glemaitre

@adrinjalali adrinjalali enabled auto-merge (squash) January 13, 2023 12:33
@adrinjalali adrinjalali merged commit f45a907 into scikit-learn:main Jan 13, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
adrinjalali added a commit that referenced this pull request Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:datasets Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistency between liac-arff and pandas parser in fetch_openml

3 participants