-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DEPR announce change of default na_values in fetch_openml #26436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As discussed IRL, I am a bit worried by issuing a We could parse the CSV twice with the old and new missing value markers to detect if its worth raising the |
Parsing twice is going to be very slow. This particular case only happens with Also, for |
Actually, this is an interesting point. The ARFF format forces us to use Theoretically, this could be considered a bug since we don't follow the ARFF specification. |
We could still merge #26433 to allow more flexibility if the ARFF uploaded is not following strictly the ARFF specs. |
Happy to merge the other one, so we close this? |
When doing the new PR, I will indicate that it will close this one. |
Follow-up of #26433
Make a smooth change of behaviour regarding the default NA values to be considered when using
fetch_openml
with the Pandas parser.In Pandas 2.X,
"None"
is now considered as an NA values. So we would have a change of behaviour depending on the pandas version. To alleviate the issue, we handle ourself the default NA values with a set of values as in Pandas 1.X. We go with a smooth deprecation cycle to adopt the list of Pandas 2.X in scikit-learn 1.5 (in 2 release).The warning can be silenced using
read_csv_kwargs
introduced in #26433 and thesklearn.datasets.FUTURE_NA_VALUES
set.The diff of this PR can be reduced by merging #26433 first the current is enough convincing to introduce the parameter
read_csv_kwargs
.