Description
Describe the bug
The BallTree and KDTree have a method query_radius that returns all the points near a certain radius.
I have noticed that when you query via iteration point by point there is no issue with memory allocation but when you perform on the entire array it saturates the memory. I am not sure whether this is a memory leak or requires some form of optimization.
Steps/Code to Reproduce
Example:
import gc
import sklearn
from sklearn.neighbors import NearestNeighbors
import numpy as np
from memory_profiler import memory_usage
import psutil
def get_memory_usage():
return memory_usage(timeout=0.2)[0]
vmem = psutil.virtual_memory()
print('sklearn:', sklearn.__version__)
print('Initial memory usage: %.2f MB' % get_memory_usage())
print('RAM available: %.2f GB' % (vmem.available/1e9))
N = 60000
X = np.random.rand(N, 2) # Random data
print('Allocated X: %.2f MB' % get_memory_usage())
print('X.nbytes == %.2f MB' % (X.nbytes / 1e6))
print('X min %.2f' % X.min())
print('X max %.2f' % X.max())
Now let's try to find all neighbours:
# query via all elements crashes memory
ind = tree.query_radius(X, r=0.5)
That will exhaust all your memory, increase or decrease N for testing.
Expected Results
If I query via iteration it all works well:
# query by iteration all elements
all = []
for i in range(X.shape[0]):
ind = tree.query_radius(X[i:i+1], r=0.5)
found = ind[0].shape[0]
all.append(found)
assert found>0
print('Allocated with tree: %.2f MB' % get_memory_usage())
About 21 seconds on Colab and around 10 MB additional memory.
Actual Results
Out of memory on a 13 GB RM colab setup. Please change N parameter in case you have a larger setup.
Versions
System:
python: 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.2.0
sklearn: 0.22.2.post1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.23
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.0.1
Built with OpenMP: True