Add content-based recommendation-system for the example gallery #1081

ArturoAmorQ · 2023-02-15T10:59:20Z

In the example gallery of scikit-learn we (more or less) follow a logic of grouping examples by module, e.g. the clustering section groups examples concerning the sklearn.cluster module.

We could try to further divide the example gallery by classes and functions of a given module using subsections to help users focus on a given algorithm, e.g. distinguish between examples using sklearn.cluster.KMeans and, say, examples using sklearn.cluster.MiniBatchKMeans. This would be similar to the already existing structure (as shown by the former links) but would introduce redundancies, as a given example would belong to several subsections from several modules.

Instead we could have a recommender system based on similarity to automatically link to the most relevant related content. This could be introduced at the end of each example (see screenshot below). Maybe a nearest neighbors tf-idf of the symbols could do the job.

I believe that other libraries may benefit from such a feature, such as the matplotlib example gallery. Thoughts on this?

\cc @jklymak @GaelVaroquaux

larsoner · 2023-02-15T14:19:54Z

My quick, not-thought-out idea would be to combine tagging / #261 with some automated aggregation/grouping system. That way when you add a new example/tutorial (or retrospectively want to classify one) you just add the appropriate labels, and sphinx-gallery (or whatever) automatically adds it in the right places

GaelVaroquaux · 2023-02-15T14:51:01Z

combine tagging / #261 with some automated aggregation/grouping system.

Right: listing the examples with the same tags. That's a good idea, though the "backend" differ. What can be in common is the "frontend" (the html + css code todisplay the lists).

GaelVaroquaux · 2023-02-15T14:52:06Z

Also, I forgot to say: I'm super enthusiastic about the idea, as it will enable us to cross-link examples without having lists to maintain.

jklymak · 2023-02-15T16:28:27Z

Matplotlib uses https://sphinx-gallery.github.io/dev/configuration.html#add-mini-galleries-for-api-documentation for the API back references. This seems a similar idea, and perhaps the same "see also" list? For our gallery entries we've been back referencing the api (see the bottom of https://matplotlib.org/stable/gallery/images_contours_and_fields/contour_demo.html, for instance), but this is done manually, so is not consistent across the library, though improving slowly.

ping @melissawm and @story645 who are perhaps getting a GSoD mentee to help with tagging the MPL examples.

story645 · 2023-02-15T16:37:42Z

Yeah I opened melissawm/sphinx-tags#33 on sphinx-tags for auto tagging on API - I think that should be doable on scraping.

@kolibril13 implemented some really dynamic search filtering in
https://github.com/kolibril13/plywood-gallery and I think it'd be really useful to also have that in sphinx gallery (especially if you want to combine w/ recommendations), possibly integrated w/ the tags as the initial auto-fills.

melissawm · 2023-02-16T12:36:38Z

I think this is a great idea and would be happy to help see it through!

This would mean an optional dependency on scikit-learn, from what I understand, for the

Maybe a nearest neighbors tf-idf of the symbols could do the job.

Because the tags are directives, they should be pretty easy to autopopulate once the clusters and classification are identified by the algorithm. Would this be a PR to sphinx-gallery or to sphinx-tags?

jklymak · 2023-02-16T13:26:51Z

Ha, I missed that you were going to try to automate this somehow. That sounds like a research project first.

melissawm · 2023-02-16T13:33:45Z

I think these can be two different approaches: have a human classify the gallery if you can (probably the best option imo!), but have an automated clustering option if you prefer.

ArturoAmorQ · 2023-02-16T14:01:02Z

I do prefer something automated on sphinx-gallery and that can be tuned to show, for instance, the 5 most relevant examples (5 nearest neighbors). Then human implemented tags can be more flexible about the number of examples assigned to a cluster and criteria form them.

larsoner · 2023-02-16T17:04:25Z

I do prefer something automated on sphinx-gallery and that can be tuned to show, for instance, the 5 most relevant examples (5 nearest neighbors). Then human implemented tags can be more flexible about the number of examples assigned to a cluster and criteria form them.

So I think there are two separate issues:

How examples are labeled in some way or considered similar to one another
Which examples to recommend at the end of each example

I think (2) probably is in scope for SG. Thinking about the manual sphinx-tags case for (1), I think it's straightforward enough to include the N most similar examples in terms of tags or whatever using some suitable distance-based algorithm for (2).

But when I see

Instead we could have a recommender system based on similarity to automatically link to the most relevant related content

I like @jklymak get a bit worried that what you're talking about implementing in SG is parsing of Python code + output to automatically label or compute "distances" between examples, i.e., solve problem (1) automatically. I think this has to be out of scope for SG because there are potentially a lot of ways to do this, and we don't have the maintenance bandwidth for it and all potential modifications people might have in mind down the road.

If you do indeed want to do this sort of "automated tagging", then one approach that could work nicely for division of maintenance between packages is:

In SG we add a Sphinx event that occurs after all examples have been run, that gives four lists, all of length n_examples:
1. list of input Python example files
2. list of example labels from sphinx-tags (if used) extracted
3. list of output RSTs generated
4. list of list of selected similar examples that will soon be linked to by adding to the RST
Then whatever modifications are made to the list of RSTs generated (e.g., modifying the RST itself) and list-of-list of selected similar examples will be used to create the final output RST.
In sklearn you write a little sphinx extension that hooks into this event, and modifies that last list in whatever way you want by parsing Python, RST, and sphinx-tags to decide the examples that should be linked

At the end of the day, the end user would need SG and sklearn installed, and add not just 'sphinx_gallery' to their Sphinx extensions but also 'sklearn.sphinxext.automated_sg_tagging' (or whatever), and all options/config/whatever for the automated system could be handled at the sklearn end (or in some other module entirely).

I think this framework is general enough that it allows people to modify the end-of-page linked example lists in whatever way they want. It also allows for easily doing stuff like easier modification of generated RST than using the source-read Sphinx event.

GaelVaroquaux · 2023-02-16T20:42:12Z

This would mean an optional dependency on scikit-learn, from what I understand, for the > Maybe a nearest neighbors tf-idf of the symbols could do the job.

No, I was thinking that we could easily implement basic version of these in pure Python + numpy.

Because the tags are directives, they should be pretty easy to autopopulate once the clusters and classification are identified by the algorithm. Would this be a PR to sphinx-gallery or to sphinx-tags?

I don't know tags enough to answer. In general, I'm happy to go whichever way makes the ecosystem healthier

ArturoAmorQ mentioned this issue Apr 5, 2023

FEA Add examples recommender system #1125

Merged

lucyleeow closed this as completed in #1125 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add content-based recommendation-system for the example gallery #1081

Add content-based recommendation-system for the example gallery #1081

ArturoAmorQ commented Feb 15, 2023

larsoner commented Feb 15, 2023

Uh oh!

GaelVaroquaux commented Feb 15, 2023 via email

Uh oh!

GaelVaroquaux commented Feb 15, 2023 via email

Uh oh!

jklymak commented Feb 15, 2023

Uh oh!

story645 commented Feb 15, 2023 •

edited

Loading

Uh oh!

melissawm commented Feb 16, 2023

Uh oh!

jklymak commented Feb 16, 2023

Uh oh!

melissawm commented Feb 16, 2023

Uh oh!

ArturoAmorQ commented Feb 16, 2023

Uh oh!

larsoner commented Feb 16, 2023

Uh oh!

GaelVaroquaux commented Feb 16, 2023 via email

Uh oh!

Add content-based recommendation-system for the example gallery #1081

Add content-based recommendation-system for the example gallery #1081

Comments

ArturoAmorQ commented Feb 15, 2023

larsoner commented Feb 15, 2023

Uh oh!

GaelVaroquaux commented Feb 15, 2023 via email

Uh oh!

GaelVaroquaux commented Feb 15, 2023 via email

Uh oh!

jklymak commented Feb 15, 2023

Uh oh!

story645 commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melissawm commented Feb 16, 2023

Uh oh!

jklymak commented Feb 16, 2023

Uh oh!

melissawm commented Feb 16, 2023

Uh oh!

ArturoAmorQ commented Feb 16, 2023

Uh oh!

larsoner commented Feb 16, 2023

Uh oh!

GaelVaroquaux commented Feb 16, 2023 via email

Uh oh!

story645 commented Feb 15, 2023 •

edited

Loading