[WIP] Add sparse-rbf
kernel option for semi_supervised.LabelSpreading
#15922
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
This pull request adds another named kernel option,
sparse-rbf
forsemi_supervised.LabelSpreading
.Loosely related to #15868.
Rationale
This kernel provides RBF weights for only the k-nearest-neighbors of each point.
Currently, the only kernel options are
knn
andrbf
- but for a large dataset, the denserbf
kernel option is not feasible (forN
items, we must compute a denseNxN
matrix). For any non-trivial dataset, computing the dense kernel matrix will be infeasible, so my purpose here is to give a kernel that can perform better thanknn
, and still be feasible for a large dataset.The intuition is that a weighted adjacency matrix gives more information about the graph structure of our dataset, and therefore with appropriate parameter tuning, such a kernel should perform better than a binary adjacency matrix. Furthermore, filling with the RBF weights is cheap, once we have found the k-nearest-neighbors, so the additional runtime cost is minimal.
In particular, I believe that the performance difference will be more clear for datasets with more "difficult" structure; therefore I included CIFAR10 as an option for the example script (the easiest way I know to obtain this is via https://pytorch.org/docs/stable/torchvision/datasets.html#cifar), though I have not yet had time to experiment with a significant fraction of that dataset.
It might also be useful to try this on some toy datasets like https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html, but I have not experimented with this yet.
Notes and Questions
I included tests to show:
sparse-rbf
are indeed the same as the top-k entries of the kernel that you get using the denserbf
optionSince I am making a claim/intuition about performance, I also included an example that does a hyperparameter grid search for the two available sparse kernels (
knn
, already provided, and the newsparse-rbf
), and then uses the optimal parameters across a range of % supervision to compare:I'm opening this PR to get some feedback on the following items:
Caveats about the hyperparameter search
GridSearchCV
. It might be possible (with some headache) to run the grid search using transductive accuracy, but this also might not matter.GridSearchCV
.Example results
Here are the results of using 20% of MNIST, with the fixed parameters included at the bottom of the example script.:

Please let me know if you need more info.
Thanks!