-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Fix array api in mean_absolute_percentage_error for older versions #29490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix array api in mean_absolute_percentage_error for older versions #29490
Conversation
@lesteve @adrinjalali I did fix the error resulting from the |
Indeed no idea what is happening but https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/59983/workflows/84052e1d-6041-4a3d-89cf-fb076b09ba99/jobs/280390?invite=true#step-105-86409_95 is showing a warning about the scoring failing with a similar stack-trace as before. So I am guessing that the underlying issue is still there?
|
Yes and the error is now showing in functions that were updated before. Do we want to keep the numpy version or can we update it? Otherwise we will need to add some kind of a check in |
The way I understand the "min" in |
I think then we would need to handle this special case in |
Do you have a small code snippet that reproduces the problem? I am trying to understand why we are calling array API code from an example. Naively I was expecting that the examples don't use the array API and hence we shouldn't touch any of the related code. |
I think to reproduce this we would need to downgrade a number of libraries like numpy, polars which are used in this example. So as @lesteve mentioned follow the instructions in quick doc to set up an environment. After that we can simply run the example. I don't have a smaller snippet because I haven't been able to reproduce this even with the full example on my mac 😄 |
Here is a snippet that reproduces for me see versions in doc-min-dependencies environment.yml amongst other things This one reproduces the issue in main but not in this PR: import polars as pl
s = pl.Series([1, 2, 3])
from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(s, s) I need to look more about reproducing the warnings, it looks like this is happening in |
Thanks for the example. I also struggle to get it to reproduce, but mostly because of difficulty with getting an environment setup with the right versions (I suspect). Looking around the code base for uses of scikit-learn/sklearn/metrics/pairwise.py Line 167 in d79cb58
sklearn/metrics/_regression.py where it is used without such a gate. Loic, could you try your reproducer with r2_score(s, s) ? Maybe the problem exists in all of those and we just haven't noticed because they aren't used in a context where a polars series is passed in (I think that is another neccessary ingredient for the reproducer).
|
Basically the series dtype Int64 seems to be throwing an error when we check this line scikit-learn/sklearn/utils/_array_api.py Lines 214 to 215 in d79cb58
But this works fine for the newer versions as we are just checking whether it matches with the float dtypes. |
Just curious, I guess the difficulty is that for arm64 macOS it's hard to get older versions, because they have not been built in conda-forge? Looking a bit further. I think this PR actually fixes the issue for I think a test should be added in As asked by Tim, I tweaked the example (see tweaked version) and it seems that other metrics are still problematic (
|
@lesteve Thank you for reporting. I think the fix in this PR might not be suitable generally. I think we need to handle the error occurring within the functions that cause this error. We cannot actually replace |
I think in a sense this is similar to #29452, as in, while array api dispatch is not enabled, we see side effects. I think we should make sure there is no side effects when dispatch is not enabled. |
For completeness it seems like numpy 1.20.3 is the latest version that has the issue, Still I share Adrin's concern about the fact that the coverage needs to be increased because it can break in mysterious ways ... |
I have removed the warning because it seems unnecessary. |
For completeness still, a snippet that shows the issue with numpy 1.20.3 and not 1.21.0 (polars version does not matter): import numpy as np
import polars as pl
s = pl.Series([1, 2, 3])
numpy_float64_dtype = np.dtype(np.float64)
# False
numpy_float64_dtype == pl.Int64
# False
s.dtype == numpy_float64_dtype
# TypeError: Cannot interpret 'Int64' as a data type
numpy_float64_dtype == s.dtype |
I added a small common test to check metric on pandas and polars series. I guess there is probably some room for improvement. This fails locally on The most wonderful thing about it (this is sarcasm just to be clear 😉) is that this is not going to make the CI red because it seems like we don't even have a build for numpy 1.19 or numpy 1.20 ... and even if we had one polars or pandas would not be installed anyway so 😓 ... I opened #29502 to actually use our minimum supported version in our min-dependencies-build and add pandas and polars into it. |
I think we might not be able to fix the codecov issue considering that the TypeError is not raised in the general case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to approve this but it would be great if someone looked at the test I added because right now, it looks quite minimal.
In particular maybe there is a better way to know which metrics support 1d input than using a try/except
?
@@ -439,7 +439,15 @@ def reshape(self, x, shape, *, copy=None): | |||
return numpy.reshape(x, shape) | |||
|
|||
def isdtype(self, dtype, kind): | |||
return isdtype(dtype, kind, xp=self) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more about it we may as well put the try/except
closer to where the problem happens namely _isdtype_single
i.e. have something like this
def _isdtype_single(dtype, kind, *, xp):
try:
# all the current code of _isdtype_single
except TypeError:
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to keep it in the NumpyArrayWrapper because that is what is used in the default case and when array api dispatch is not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: I know a lot less than you about the array API code so I would trust you on this 😉
Naively I thought the problem would happen with _isdtype_single(dtype, "real floating", xp=np)
but it doesn't because np.float32
is not a dtype whereas xp.float32
is a dtype (i.e. np.dtype(np.float32)
) not sure if this is a very naive assumption on my part ...
In other words with numpy < 1.21 here is the behaviour
from sklearn.utils._array_api import _isdtype_single
from sklearn.utils._array_api import _NumPyAPIWrapper
import polars as pl
dtype = pl.Series([1, 2, 3]).dtype
# no issue because dtype is compared to np.float32 which is not a dtype
# I would have naively expected an error since I thought the comparison
# would be with np.dtype(np.float32)
_isdtype_single(dtype, "real floating", xp=np)
xp = _NumPyAPIWrapper()
# issue because under the hood dtype == np.float32 happens, which is an error
# TypeError: Cannot interpret 'Int64' as a data type
_isdtype_single(dtype, "real floating", xp=xp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point and to handle all scenarios I think it would make sense to catch the exception in _isdtype_single
. However I don't think we would want to use this function directly from the code. As for any case other than numpy, since we would have array apis I think this would break anyways because series objects would probably not be compatible with other array types.
I think the tests look fine since we are just checking that polars and pandas work without any errors. |
@adrinjalali @betatim Could you have a look at this PR, so that we can resolve the CI failures occurring? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
I enabled auto-merge, let's make doc-min-dependencies gr |
Reference Issues/PRs
Follow up of #29300
What does this implement/fix? Explain your changes.
Any other comments?