-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MemoryError in KNNImputer with california housing #15604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
ping @thomasjpfan any idea (it is a bit late to debug now here) |
Maybe, your dataset is too big. |
#15615 will save some memory. However, I was wondering if we could have a memory-efficient It seems that we compute the pairwise distance matrix between scikit-learn/sklearn/impute/_knn.py Lines 228 to 232 in 97958c1
If
ping @thomasjpfan @jnothman |
I'm not sure what your proposals mean. we only calc distances for rows with at least one value missing. yes we can probably improve memory efficiency by considering one target sample at a time, rather than the column-at-a-time approach here. but I think we are best off seeing how this goes in practice before optimising it too much. I think the problem can be trivially chunked by receiving row. are you still having trouble with MemoryError? |
Nop. I only have a pick of memory at 4 GB for the above example. However |
Let's close this issue. As @jnothman mentioned, we can wait for feedback to see if we need more optimization using chunking. |
I am getting the same issue on a real life dataset that is fairly medium in size (i.e 100k rows and 100 columns). Some of the columns are sparse. So when trying to impute about 20-30 of them the KNNImputer consumes about 150GB of memory on an AWS instance and just runs forever without finishing. After sometime the memory usage drops without anything happening afterwards. Is there a remedy for this? Both SimpeImputer and IterativeImputer finish very quickly on the same dataset. |
Could you open a new issue mentioning this information? We could then think about improving the imputer. |
Is that good to do given the fact that I cant share the data so I dont have an example to reproduce the issue? |
Yes, please. Memory usage shouldn't be too dependent on actual data. If you can produce a smaller synthetic dataset with a comparable amount of sparsity and NA to impute that would be helpful. Likely a smaller dataset for which KNNimputer would take a few GB as opposed 100s of GB in your example that we can run it locally on a laptop would be best.. |
Same problem here with 100k rows and 100 features. The problem arises from metrics\pairwise.py. |
Thanks, the fact that we can't apply KNNImputer on 100k samples is indeed problematic. |
I suppose we can apply chunking in KNNImputer to fix the max memory
consumption.
|
Same problem here with a (59972, 11) shape dataset |
yes we can call the `pairwise_distance_chuncked`
…On Mon, 3 Feb 2020 at 15:51, BAKHTI Yassine ***@***.***> wrote:
Same problem here with a (59972, 11) shape dataset
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#15604?email_source=notifications&email_token=ABY32P7DVWLO76TNCI6BQJDRBAVPLA5CNFSM4JL5FBLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKUDZBA#issuecomment-581450884>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P4EG6RQ46H4UWVLXMLRBAVPLANCNFSM4JL5FBLA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
We can, if we define a method that computes the results for each chunk
|
But the problem is how can we define these chunks as the final result for a specific value depends on data points which may not be in the defined chunk. |
I think the algorithm can be parallelised over rows receiving... So in that
sense it should be possible to do chunked. On the other hand, as long as we
support arbitrary weighting functions, you can't parallelise or chunk over
donors.
|
Same problem here. |
I have the same problem. |
@thomasjpfan I didn't see you self-assign this. Maybe even core devs should make use of "take". I think I have a patch. |
Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.
Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.
Fixes scikit-learn#15604 This is more computationally expensive than the previous implementation, but should reduce memory costs substantially in common use cases.
same issue here as well for MDS
|
I was doing a simple example with california housing and the
KNNImputer
blow up into my face:The text was updated successfully, but these errors were encountered: