Closed
Description
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
X, y = make_blobs(centers=2, random_state=4, n_samples=30)
knn = KNeighborsClassifier(algorithm='kd_tree').fit(X, y)
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min, y_max = X[:, 1].min(), X[:, 1].max()
xx = np.linspace(x_min, x_max, 1000)
# change 100 to 1000 below and wait a long time
yy = np.linspace(y_min, y_max, 100)
X1, X2 = np.meshgrid(xx, yy)
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = knn.predict(X_grid)
spends all it's time in unique within stats.mode, not within the distance calculation. mode
runs unique
for every row.
I'm pretty sure we can replace the call to mode by some call to making a csr matrix and then argmax.
How much is it worth optimizing this? I feel KNN should be fast in low dimensions and people might actually use this. Having the bottleneck in the wrong place just feels wrong to me ;)