[WIP] Self organising map #2996

naught101 · 2014-03-24T12:34:14Z

Update of Sebastien Campion's pull request, as discussed in #2892

TODO:

add some narrative documentation as a new section in the clustering chapter, in particular try to highligh pros and cons and typical domains where SOM is applied with success in the real world.
investigate if SOM's usefulness on some of scikit-learn builtin (non-synthetic) datasets and showcase it as an example (new or updated existing example)

ogrisel · 2014-03-24T12:56:05Z

sklearn/cluster/__init__.py

+    '''
+    The pseudo F statistic :
+
+    pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])] 


Please define T, G, PG and n. People should not have to read the reference to understand the documentation.

This should probably be moved somewhere else, or removed entirely, I don't yet know what it is, or how useful it is.. It is currently only used in the examples. Suggestions for where to stick it are welcome :)

I am fine with adding a new scoring metric as long as it's used in the literature and properly documented (both in the docstring and in the narrative doc).

Right, but where should it be added?

I moved the function into the examples/cluster/som_digits.py, and explained what's happening, and why it's useful in commit 367c44b

NelleV · 2014-03-24T13:21:02Z

examples/cluster/plot_som_colormap.py

+import pylab as pl
+from matplotlib.colors import ListedColormap, NoNorm, rgb2hex
+import numpy as np
+from scikits.learn.cluster import SelfOrganizingMap


Oh gosh… This is old.

I will try to update the examples tomorrow.

ogrisel · 2014-03-24T14:29:12Z

The broken tests reported by travis are real failures.

naught101 · 2014-03-24T23:17:16Z

@ogrisel : it's a problem of convergence. The tests converge eventually, but it needs so many iterations that the tests become slow. It's possible that the generated data sets are pathological for this particular type of model. I plan to look in to that more.

naught101 · 2014-03-25T00:26:04Z

examples/cluster/plot_som_colormap.py

+plot(init)
+plt.title('Initial map')
+
+som = SelfOrganizingMap(affinity=(16,16), n_iterations=1024,


This example doesn't appear to be working at the moment. Might need to look into it more. It's very similar to the example at http://www.pymvpa.org/examples/som.html

naught101 · 2014-03-26T00:39:59Z

@NelleV Now would be a good time to review the code. I've been over it in depth with the debugger, and it looks like it's doing the right thing, but it isn't converging very well in some circumstances (in particular, the examples/cluster/plot_som_colormap.py doesn't really converge at all).

I've tried a couple of modifications, for which I will add comments in a minute, and some of these make the colormap example converge somewhat, but still really slowly, but they make one or both of the passing tests fail.

Main differences between this algorithm and Kohonen (1990), is that we're using an arbitrary graph for the SOM map, (which is a grid by default) so we can't use the L2 metric, and must use the L1 metric (post office/shortest path). This affects both the neighbourhood and radius functions.

The neighborhood, radius, and alpha (learning rate) functions are all somewhat arbitrary - Kohonen doesn't specify particular functions, just that they are monotonically decreasing as functions of time (iteration), or, in the case of neighbourhood, of distance between map nodes. It's not obvious to me that this shouldn't work in any particular case.

naught101 · 2014-03-26T00:41:41Z

sklearn/cluster/som_.py

+        winner = self.best_matching_center(x)
+        radius = self.radius_of_the_neighborhood(iteration)
+        updatable = self.cluster_centers_in_radius(winner, radius)
+        distances = np.sum((self.cluster_centers_[winner] - self.cluster_centers_[updatable])**2, axis=1)


this should be

distances = self.distance_matrix[winner][updatable]

but if I make it that, then currently passing tests start failing. Not sure why. It DOES make the plot example start to converge, but only really slowly (increase n_iterations in the plot example to try it).

naught101 · 2014-03-26T03:21:37Z

Ok, the updater function was wrong (pulling the distance from x to the winning node, instead of to the node to be pulled). It's working now, and converging well. I'll see if I can improve the radius and alpha functions, but it actually works quite well now

One of the cluster comparison data sets (the one with three clusters) is pathological for a 2x2 SOM. It works fine with a 3x1 grid, but then the others don't work so well.

naught101 · 2014-04-07T01:30:17Z

Can someone please have a look at this? it's been nearly 2 weeks since I got it working. The test failure appears to be something to do with estimator cloning - something that I don't touch on. I have tried doing what I think the test is doing manually, and I can't replicate the problem (I don't know how AttributeError: 'NoneType' object has no attribute 'shape' is happening). If someone could give me a heads-up on why this might be happening, that would be very useful.

agramfort · 2014-04-07T06:54:13Z

examples/cluster/plot_cluster_comparison.py

@@ -84,6 +84,8 @@
    average_linkage = cluster.AgglomerativeClustering(linkage="average",
                            affinity="cityblock", n_clusters=2,
                            connectivity=connectivity)
+    som = cluster.SelfOrganizingMap(adjacency=(2, 2),
+                                    n_iterations= 1000)


ogrisel · 2014-04-07T09:48:06Z

Also, I am still not convinced about the SOM usefulness. In the todo list the item:

investigate if SOM's usefulness on some of scikit-learn builtin (non-synthetic) datasets and showcase it as an example (new or updated existing example)

has been ticked. However then only on example on the digits data and the CH score is lower (which I think means that the clustering quality is not as good as k-means' according to the CH assumption. This does not really highlight the SOM method, especially as the fitting time is significantly higher.

However when using other metrics the SOM solution is not that bad. For instance Adjusted Rand Score computed with the true digits labels says that the quality of the SOM solution is on average similar to the k-means solution.

Please also include the SOM method in the cluster comparison example.

Similarly the other TODO item has also been ticked:

add some narrative documentation as a new section in the clustering chapter, in particular try to highlight pros and cons and typical domains where SOM is applied with success in the real world.

By reading the narrative document I still don't get any intuition as to where one would rather use SOM vs other clustering method by reading the examples or the documentation. What are the known applications where SOM has been shown to work well in practice?

Also out of curiosity, what is your personal interest in using SOM rather than simpler methods such as k-means?

Please untick those TODO items and address those comments first. As I said earlier, there is no point in implementing, reviewing and maintaining methods in scikit-learn if they cannot be clearly demonstrated to be practically useful and better than other simpler methods (at least for some problems) in the documentation and the examples.

To clarify my point, I do not state SOMs are practically useless (I don't know), I just want to be convinced they are not, as would the reader of the scikit-learn documentation would too.

naught101 · 2014-04-07T14:18:44Z

Please also include the SOM method in the cluster comparison example.
I have, see naught101@0f1b0de - this is why I ticked the second box.

However then only on example on the digits data and the CH score is lower (which I think means that the clustering quality is not as good as k-means' according to the CH assumption. This does not really highlight the SOM method, especially as the fitting time is significantly higher.

However when using other metrics the SOM solution is not that bad. For instance Adjusted Rand Score computed with the true digits labels says that the quality of the SOM solution is on average similar to the k-means solution.

This is a difficulty: The SOM theoretically should always perform worse than K-means. As discussed at #2892, it isn't the simple fitting that SOM is useful for, rather it is the fact that the SOM grid adds a semantic layer to the cluster layout. SOMs are basically useful for a) dimensionality reduction, where there are strong non-linear patterns in the data, and b) for using the semantic grid as a basis for creating small-multiple plots based on data from each cluster.

By reading the narrative document I still don't get any intuition as to where one would rather use SOM vs other clustering method by reading the examples or the documentation. What are the known applications where SOM has been shown to work well in practice?

I will see if I can come up with some examples that show these uses.

ogrisel · 2014-04-07T16:54:39Z

Thanks!

as per scikit-learn#2996 (comment)

…lot example

as per scikit-learn#2996 (comment)

amueller · 2016-10-25T19:59:52Z

I think this should go into contrib (looks like @naught101 gave up and then decided to not give up?).

amueller · 2016-10-25T20:07:45Z

closing for now, feel free to reopen and argue. I think it would be nice to have in contrib, but not for master.

naught101 · 2016-10-26T00:38:49Z

I've been busy with other stuff, and haven't needed this module. Sorry that I haven't had time to fill out the examples. I was kind of hoping thinking a code review beforehand would be useful, as well as some indication that this would be useful in sklearn. I'm not too hung up on it being in the core package, but I'm worried that I wouldn't be able to adequately maintain it in contrib (due to lack of expertise in the theory, as well as code efficiency). But I guess that's as good an argument for keeping it out of core, too :)

jnothman · 2016-10-26T01:02:57Z

I don't think having an unmaintained extension is much of an issue.

On 26 October 2016 at 11:38, naught101 [email protected] wrote:

I've been busy with other stuff, and haven't needed this module. Sorry
that I haven't had time to fill out the examples. I was kind of hoping
thinking a code review beforehand would be useful, as well as some
indication that this would be useful in sklearn. I'm not too hung up on it
being in the core package, but I'm worried that I wouldn't be able to
adequately maintain it in contrib (due to lack of expertise in the theory,
as well as code efficiency). But I guess that's as good an argument for
keeping it out of core, too :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69uX40IprBOeZQqYFRG81jOfuKhEks5q3qEagaJpZM4BsNbC
.

ogrisel reviewed Mar 24, 2014
View reviewed changes

ogrisel changed the title ~~Self organising map~~ [WIP] Self organising map Mar 24, 2014

NelleV reviewed Mar 24, 2014
View reviewed changes

naught101 reviewed Mar 25, 2014
View reviewed changes

naught101 reviewed Mar 26, 2014
View reviewed changes

agramfort reviewed Apr 7, 2014
View reviewed changes

naught101 added a commit to naught101/scikit-learn that referenced this pull request Apr 10, 2014

Rename all attributes generated in .fit()

5c29140

as per scikit-learn#2996 (comment)

rolisz mentioned this pull request Apr 11, 2014

About adding neural netowrk #3050

Closed

naught101 mentioned this pull request Apr 14, 2014

Add Self Organising Map as a clustering algorithm #2892

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

naught101 added 18 commits April 7, 2016 14:05

Rename Calinski-Harabasz criterion to its most common name

2c04322

Make print statements python3 friendly

920a2ad

rename affinity matrix to adjacency

a7b9a77

Add basic description to the clustering chapter

2bc2b98

standardise attribute name .cluster_centers_

73de49c

Add SOM to cluster comparison

8d05ad6

PEP 257, and python3 friendly printing

7f76485

Avoid side-effects, return list of cluster, comments

fa51433

Improve readability

e0ecf10

Remove accidentall commited debugging code

bd4857b

Correct updating function, and use proper distances.

e69024b

use less iterations for tests

86a2656

rename self.dim self.dim_, lowercase functions, pep8, reference non-p…

92e7b53

…lot example

Move SOM generation code to fit() to fix cloning test

4918341

pep8 spacing

4e8c5f9

Rename all attributes generated in .fit()

4129930

as per scikit-learn#2996 (comment)

Add PCA initialisation option

1a22857

initial dimension reduction example

0cd1b8a

naught101 closed this Apr 7, 2016

naught101 deleted the self_organising_map branch April 7, 2016 04:08

naught101 restored the self_organising_map branch April 7, 2016 04:08

naught101 reopened this Apr 7, 2016

naught101 force-pushed the self_organising_map branch from 5c4bff6 to 0cd1b8a Compare April 7, 2016 04:12

amueller closed this Oct 25, 2016

jnothman mentioned this pull request Sep 15, 2017

[MRG] Creating SOM(Self-Organizing Maps) algorithm #9770

Closed

relf mentioned this pull request May 19, 2025

Implement self-organizing maps (SOMs) in linfa-clustering rust-ml/linfa#386

Open

Uh oh!

[WIP] Self organising map #2996

[WIP] Self organising map #2996

Uh oh!

Conversation

naught101 commented Mar 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 24, 2014

Uh oh!

naught101 commented Mar 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naught101 commented Mar 26, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naught101 commented Mar 26, 2014

Uh oh!

naught101 commented Apr 7, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Apr 7, 2014

Uh oh!

naught101 commented Apr 7, 2014

Uh oh!

ogrisel commented Apr 7, 2014

Uh oh!

amueller commented Oct 25, 2016

Uh oh!

amueller commented Oct 25, 2016

Uh oh!

naught101 commented Oct 26, 2016

Uh oh!

jnothman commented Oct 26, 2016

Uh oh!

Uh oh!