Thanks to visit codestin.com
Credit goes to github.com

Skip to content

DEPR announce change of default na_values in fetch_openml #26436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented May 25, 2023

Follow-up of #26433

Make a smooth change of behaviour regarding the default NA values to be considered when using fetch_openml with the Pandas parser.

In Pandas 2.X, "None" is now considered as an NA values. So we would have a change of behaviour depending on the pandas version. To alleviate the issue, we handle ourself the default NA values with a set of values as in Pandas 1.X. We go with a smooth deprecation cycle to adopt the list of Pandas 2.X in scikit-learn 1.5 (in 2 release).

The warning can be silenced using read_csv_kwargs introduced in #26433 and the sklearn.datasets.FUTURE_NA_VALUES set.

The diff of this PR can be reduced by merging #26433 first the current is enough convincing to introduce the parameter read_csv_kwargs.

@ogrisel
Copy link
Member

ogrisel commented May 30, 2023

As discussed IRL, I am a bit worried by issuing a FutureWarning whenever we call fetch_openml with the default arguments, even when there are no missing values in with old and new pandas.

We could parse the CSV twice with the old and new missing value markers to detect if its worth raising the FutureWarning. It's 2x the cost but maybe it's not that bad and off course we would not do it when the user is explicit about the missing value markers and furthermore, it's a transient overhead: after 2 scikit-learn releases we would be back to parsing the data only once (with the new default missing value markers).

@adrinjalali
Copy link
Member

Parsing twice is going to be very slow. This particular case only happens with "None" as a string. We could also check if that exists maybe and only warn in those cases?

Also, for fetch_openml, I hope there aren't many "None" values used to indicate missing value, so we could also assume it doesn't exist and set the parameter so that we use the old pandas behavior here.

@glemaitre
Copy link
Member Author

glemaitre commented Jun 1, 2023

Also, for fetch_openml, I hope there aren't many "None" values used to indicate missing value, so we could also assume it doesn't exist and set the parameter so that we use the old pandas behavior here.

Actually, this is an interesting point. The ARFF format forces us to use "?" (cf. https://www.cs.waikato.ac.nz/~ml/weka/arff.html) as a missing marker. I forgot to set keep_default_na_values=False when making the parser meaning that we set additional markers for missing values.

Theoretically, this could be considered a bug since we don't follow the ARFF specification.
We could simply set keep_default_na_values=False and make it a bug fix. It might introduce some regressions as well.

@glemaitre
Copy link
Member Author

We could still merge #26433 to allow more flexibility if the ARFF uploaded is not following strictly the ARFF specs.

@adrinjalali
Copy link
Member

Happy to merge the other one, so we close this?

@glemaitre
Copy link
Member Author

When doing the new PR, I will indicate that it will close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants