-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Avoid np.asarray call in check_array for duck-typed arrays #11447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
why did that pr change the effect on dask? We currently try not to cast
memmaps to default arrays, so I suppose things should work for dask too?
We should consider explicitly testing our handling of dask. But how to use
duck-typing? A problem is that pandas DataFrames are array-like but have
different indexing semantics.
|
Dask is fine, Dask-ML had issues. It looks like the Dask-ML RobustScalar inherits from Scikit-Learn's and reuses the transform method (I guess it previously only used dask-array comptible opeations. Inside check_array there are a few calls to |
It looks like RobustScalar has a |
Ah, no, I was confused. My mistake. |
Means what.... what's the status?
|
We re-implemented the transform and inverse transform methods in dask-ml's
RobustScalar class. Previously we shared the implementation.
In terms of immediate impact this isn't a big deal. My main objective
behind this issue is to raise the topic of avoiding the coercion of input
to numpy arrays when it isn't strictly necessary.
…On Sat, Jul 7, 2018 at 7:05 AM Joel Nothman ***@***.***> wrote:
Means what.... what's the status?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#11447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszLK74TIGYDTSiQoxF_-zI-CRel0fks5uEJYFgaJpZM4VErc9>
.
|
See also some related discussion at #11043 I do think this is an issue we should be dealing with. I would be very interested in seeing a PR which:
I'd like to see if we can get tests passing out of the box, or whether this needs to be enabled on a per-estimator basis. I'd like to see whether there's a sensible way to duck type this, or whether we need to make specific exceptions for a while. I don't think you should expect this support for the coming release, unfortunately. |
No worries about coming release. I'm playing a long game here with Dask/SKLearn interactions :) The improved joblib/dask interaction is something I'm eagerly waiting to be in a sklearn release, so I would be sad to hold things up. I can probably find someone to help with the check_array PR. Maybe this is a good scipy sprint task. I'm not sure I understand the estimator_checks comment. There will be many cases where passing through dask arrays definitely won't make sense (any time cython code assumes the numpy memory model for example). My guess is that checks applied to all estimators would raise too many special case problems. I may be misinterpreting this comment. Also cc'ing @TomAugspurger |
usually we would aim to check that all estimators either behave
consistently or raise an informative error.
but I suspect we will find ourselves implementing something like
accept={'array','array-like','frame-like'} in check_array...
|
Yes, I think we need to have a general discussion about |
It could be a good start to have a CI build running the tests of |
Was just about to open an issue like this. Thankfully someone beat me to it by a year. 😄 Simply replace Dask with CuPy and dask-ml with cuML to get the issue I'm encountering. Agree it would be great to support array-like things in scikit-learn. What would be our next steps here? |
Also see #14702 for us going in the opposite direction ;) @jakirkham can you provide an example for estimators that you're interested in? |
In the interest of seeing what this would buy us, I made a list of estimators where this could potentially help: The first column is the ones that do not call into cython or scipy optimize, though they might use second column calls scipy.optimize and is unlikely to be fixable with nep13 and nep18 and third column is definitely impossible without a rewrite as it makes substantial use of Cython. And "of course" we need to use numpy.linalg whenever we want to support these arrays, not scipy.linalg, meaning we likely have to wrap the common linalg methods to do the dispatch. cc @thomasjpfan |
Thanks for putting that list together @amueller! I think what is most interesting to us is what you have called meta-estimators in the spreadsheet (column 4). These are interesting as we can combine them with things like Dask, CuPy, and/or other things where the array type is handled correctly by some estimator we provide. Just to pick one to start out, Though it is also interesting to see a fair number of estimators fit in column 1. So it's good to hear that those might be usable with other array types. It might be interesting to explore this later on. Does this seem like a reasonable place to start? cc @JohnZed |
@jakirkham yes, that sounds good. Sorry for the slow reply. |
Thinks this was done as part of the Array API work in PR: #22554 Should we close? |
Would it be reasonable to avoid np.asarray calls in check_array if the input is array-like?
My guess is that the general answer is "No. The np.asarray call is important to guarantee consistency within scikit-learn's estimators. We strongly value consistency."
The context here is that after the recently merged #11308 , some scikit-learn transformers like RobustScalar that used to work on dask arrays no longer work because they auto-coerce their inputs into numpy arrays. This likely comes up in other situations as well.
The text was updated successfully, but these errors were encountered: