[WIP] Add `sparse-rbf` kernel option for `semi_supervised.LabelSpreading` #15922

nik-sm · 2019-12-18T23:39:40Z

Reference Issues/PRs

This pull request adds another named kernel option, sparse-rbf for semi_supervised.LabelSpreading.

Loosely related to #15868.

Rationale

This kernel provides RBF weights for only the k-nearest-neighbors of each point.

Currently, the only kernel options are knn and rbf - but for a large dataset, the dense rbf kernel option is not feasible (for N items, we must compute a dense NxN matrix). For any non-trivial dataset, computing the dense kernel matrix will be infeasible, so my purpose here is to give a kernel that can perform better than knn, and still be feasible for a large dataset.

The intuition is that a weighted adjacency matrix gives more information about the graph structure of our dataset, and therefore with appropriate parameter tuning, such a kernel should perform better than a binary adjacency matrix. Furthermore, filling with the RBF weights is cheap, once we have found the k-nearest-neighbors, so the additional runtime cost is minimal.

In particular, I believe that the performance difference will be more clear for datasets with more "difficult" structure; therefore I included CIFAR10 as an option for the example script (the easiest way I know to obtain this is via https://pytorch.org/docs/stable/torchvision/datasets.html#cifar), though I have not yet had time to experiment with a significant fraction of that dataset.

It might also be useful to try this on some toy datasets like https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html, but I have not experimented with this yet.

Notes and Questions

I included tests to show:

this kernel "works" (the learned classifier has high accuracy)
the weights found using sparse-rbf are indeed the same as the top-k entries of the kernel that you get using the dense rbf option

Since I am making a claim/intuition about performance, I also included an example that does a hyperparameter grid search for the two available sparse kernels (knn, already provided, and the new sparse-rbf), and then uses the optimal parameters across a range of % supervision to compare:

transductive accuracy (guessing labels for the unlabeled training data)
inductive accuracy (labeling test data)
runtime

I'm opening this PR to get some feedback on the following items:

Does this kernel seem helpful, and could I make it more helpful somehow?
Is the example I provided clear enough, and would other examples be useful to show the potential benefit of this kernel?
For sklearn examples in general, I'm not sure (practically speaking) how to get the results to be visible as embedded plots, so is there something I should do to structure my example appropriately?
Are there other experiments/examples that would be useful for evaluating this kernel that folks would like to see?

Caveats about the hyperparameter search

The search is done using only the inductive accuracy because of the API used by GridSearchCV. It might be possible (with some headache) to run the grid search using transductive accuracy, but this also might not matter.
The search is done at a fixed % supervision, again because of the API for GridSearchCV.
Notice that the exact hyperparameters found depend on what dataset and what fraction of that dataset is used for the grid search.

Example results

Here are the results of using 20% of MNIST, with the fixed parameters included at the bottom of the example script.:

Please let me know if you need more info.
Thanks!

…-learn#15866)

nik-sm · 2019-12-18T23:41:49Z

sklearn/semi_supervised/_label_propagation.py

+            np.exp(W.data, out=W.data)
+            # explicitly set diagonal,
+            # since np.exp(W.data) does not modify zeros on the diagonal
+            W.setdiag(1)


Note that this line causes a warning about changing the sparsity structure.

I think the 2 best options here are:

Suppress warning (if starting with lil matrix and converting to csc is actually slower overall)

Start with lil matrix and convert to csc after.

My guess is that the correct thing to do is suppress the warning, but I left it for now so others could weigh in on this.

nik-sm added 7 commits December 11, 2019 20:53

FIX use safe_sparse_dot for callable kernel in LabelSpreading (scikit…

891792c

…-learn#15866)

FIX use safe_sparse_dot for callable kernel in LabelSpreading (scikit…

3bd7e9f

…-learn#15866)

FIX use safe_sparse_dot for callable kernel in LabelSpreading (scikit…

b8bc7d8

…-learn#15866)

WIP - sparse RBF kernel

da58627

WIP - sparse RBF kernel

57df7a2

WIP - sparse RBF kernel

741ef07

WIP - sparse RBF kernel

ef0703d

nik-sm commented Dec 18, 2019

View reviewed changes

Fix plot legend and use 20% data

12610aa

github-actions bot added the module:semi_supervised label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Add `sparse-rbf` kernel option for `semi_supervised.LabelSpreading` #15922

[WIP] Add `sparse-rbf` kernel option for `semi_supervised.LabelSpreading` #15922

Uh oh!

nik-sm commented Dec 18, 2019

Uh oh!

nik-sm Dec 18, 2019

Uh oh!

Uh oh!

Uh oh!

[WIP] Add sparse-rbf kernel option for semi_supervised.LabelSpreading #15922

Are you sure you want to change the base?

[WIP] Add sparse-rbf kernel option for semi_supervised.LabelSpreading #15922

Uh oh!

Conversation

nik-sm commented Dec 18, 2019

Reference Issues/PRs

Rationale

Notes and Questions

Caveats about the hyperparameter search

Example results

Uh oh!

nik-sm Dec 18, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[WIP] Add `sparse-rbf` kernel option for `semi_supervised.LabelSpreading` #15922

[WIP] Add `sparse-rbf` kernel option for `semi_supervised.LabelSpreading` #15922