DEPR announce change of default na_values in fetch_openml #26436

glemaitre · 2023-05-25T18:42:59Z

Follow-up of #26433

Make a smooth change of behaviour regarding the default NA values to be considered when using fetch_openml with the Pandas parser.

In Pandas 2.X, "None" is now considered as an NA values. So we would have a change of behaviour depending on the pandas version. To alleviate the issue, we handle ourself the default NA values with a set of values as in Pandas 1.X. We go with a smooth deprecation cycle to adopt the list of Pandas 2.X in scikit-learn 1.5 (in 2 release).

The warning can be silenced using read_csv_kwargs introduced in #26433 and the sklearn.datasets.FUTURE_NA_VALUES set.

The diff of this PR can be reduced by merging #26433 first the current is enough convincing to introduce the parameter read_csv_kwargs.

Co-authored-by: Thomas J. Fan <[email protected]>

ogrisel · 2023-05-30T15:00:06Z

As discussed IRL, I am a bit worried by issuing a FutureWarning whenever we call fetch_openml with the default arguments, even when there are no missing values in with old and new pandas.

We could parse the CSV twice with the old and new missing value markers to detect if its worth raising the FutureWarning. It's 2x the cost but maybe it's not that bad and off course we would not do it when the user is explicit about the missing value markers and furthermore, it's a transient overhead: after 2 scikit-learn releases we would be back to parsing the data only once (with the new default missing value markers).

adrinjalali · 2023-06-01T11:30:53Z

Parsing twice is going to be very slow. This particular case only happens with "None" as a string. We could also check if that exists maybe and only warn in those cases?

Also, for fetch_openml, I hope there aren't many "None" values used to indicate missing value, so we could also assume it doesn't exist and set the parameter so that we use the old pandas behavior here.

glemaitre · 2023-06-01T12:32:00Z

Also, for fetch_openml, I hope there aren't many "None" values used to indicate missing value, so we could also assume it doesn't exist and set the parameter so that we use the old pandas behavior here.

Actually, this is an interesting point. The ARFF format forces us to use "?" (cf. https://www.cs.waikato.ac.nz/~ml/weka/arff.html) as a missing marker. I forgot to set keep_default_na_values=False when making the parser meaning that we set additional markers for missing values.

Theoretically, this could be considered a bug since we don't follow the ARFF specification.
We could simply set keep_default_na_values=False and make it a bug fix. It might introduce some regressions as well.

glemaitre · 2023-06-01T12:38:38Z

We could still merge #26433 to allow more flexibility if the ARFF uploaded is not following strictly the ARFF specs.

adrinjalali · 2023-06-01T14:20:08Z

Happy to merge the other one, so we close this?

glemaitre · 2023-06-09T08:18:22Z

When doing the new PR, I will indicate that it will close this one.

glemaitre and others added 4 commits May 25, 2023 16:36

ENH allows to overwrite read_csv parameter in fetch_openml

1c26cb2

update pr number

e5802ff

Apply suggestions from code review

e600588

Co-authored-by: Thomas J. Fan <[email protected]>

DEPR announce change of default na_values in fetch_openml

a14b0cd

github-actions bot added the module:datasets label May 25, 2023

DOC an entry in changelog

bf8f5f5

glemaitre mentioned this pull request May 25, 2023

ENH allows to overwrite read_csv parameter in fetch_openml #26433

Merged

iter

e53894b

glemaitre mentioned this pull request Jun 9, 2023

FIX only consider "?" as missing marker as per ARFF specs #26551

Merged

jeremiedbb closed this in #26551 Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DEPR announce change of default na_values in fetch_openml #26436

DEPR announce change of default na_values in fetch_openml #26436

Uh oh!

glemaitre commented May 25, 2023 •

edited

Loading

Uh oh!

ogrisel commented May 30, 2023

Uh oh!

adrinjalali commented Jun 1, 2023

Uh oh!

glemaitre commented Jun 1, 2023 •

edited

Loading

Uh oh!

glemaitre commented Jun 1, 2023

Uh oh!

adrinjalali commented Jun 1, 2023

Uh oh!

glemaitre commented Jun 9, 2023

Uh oh!

Uh oh!

Uh oh!

DEPR announce change of default na_values in fetch_openml #26436

DEPR announce change of default na_values in fetch_openml #26436

Uh oh!

Conversation

glemaitre commented May 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented May 30, 2023

Uh oh!

adrinjalali commented Jun 1, 2023

Uh oh!

glemaitre commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jun 1, 2023

Uh oh!

adrinjalali commented Jun 1, 2023

Uh oh!

glemaitre commented Jun 9, 2023

Uh oh!

Uh oh!

glemaitre commented May 25, 2023 •

edited

Loading

glemaitre commented Jun 1, 2023 •

edited

Loading