Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Self organising map #2996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from

Conversation

naught101
Copy link

Update of Sebastien Campion's pull request, as discussed in #2892

TODO:

  • add some narrative documentation as a new section in the clustering chapter, in particular try to highligh pros and cons and typical domains where SOM is applied with success in the real world.
  • investigate if SOM's usefulness on some of scikit-learn builtin (non-synthetic) datasets and showcase it as an example (new or updated existing example)

'''
The pseudo F statistic :

pseudo F = [( [(T - PG)/(G - 1)])/( [(PG)/(n - G)])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define T, G, PG and n. People should not have to read the reference to understand the documentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be moved somewhere else, or removed entirely, I don't yet know what it is, or how useful it is.. It is currently only used in the examples. Suggestions for where to stick it are welcome :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with adding a new scoring metric as long as it's used in the literature and properly documented (both in the docstring and in the narrative doc).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but where should it be added?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the function into the examples/cluster/som_digits.py, and explained what's happening, and why it's useful in commit 367c44b

@ogrisel ogrisel changed the title Self organising map [WIP] Self organising map Mar 24, 2014
import pylab as pl
from matplotlib.colors import ListedColormap, NoNorm, rgb2hex
import numpy as np
from scikits.learn.cluster import SelfOrganizingMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gosh… This is old.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to update the examples tomorrow.

@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2014

The broken tests reported by travis are real failures.

@naught101
Copy link
Author

@ogrisel : it's a problem of convergence. The tests converge eventually, but it needs so many iterations that the tests become slow. It's possible that the generated data sets are pathological for this particular type of model. I plan to look in to that more.

plot(init)
plt.title('Initial map')

som = SelfOrganizingMap(affinity=(16,16), n_iterations=1024,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't appear to be working at the moment. Might need to look into it more. It's very similar to the example at http://www.pymvpa.org/examples/som.html

@naught101
Copy link
Author

@NelleV Now would be a good time to review the code. I've been over it in depth with the debugger, and it looks like it's doing the right thing, but it isn't converging very well in some circumstances (in particular, the examples/cluster/plot_som_colormap.py doesn't really converge at all).

I've tried a couple of modifications, for which I will add comments in a minute, and some of these make the colormap example converge somewhat, but still really slowly, but they make one or both of the passing tests fail.

Main differences between this algorithm and Kohonen (1990), is that we're using an arbitrary graph for the SOM map, (which is a grid by default) so we can't use the L2 metric, and must use the L1 metric (post office/shortest path). This affects both the neighbourhood and radius functions.

The neighborhood, radius, and alpha (learning rate) functions are all somewhat arbitrary - Kohonen doesn't specify particular functions, just that they are monotonically decreasing as functions of time (iteration), or, in the case of neighbourhood, of distance between map nodes. It's not obvious to me that this shouldn't work in any particular case.

winner = self.best_matching_center(x)
radius = self.radius_of_the_neighborhood(iteration)
updatable = self.cluster_centers_in_radius(winner, radius)
distances = np.sum((self.cluster_centers_[winner] - self.cluster_centers_[updatable])**2, axis=1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be

        distances = self.distance_matrix[winner][updatable]

but if I make it that, then currently passing tests start failing. Not sure why. It DOES make the plot example start to converge, but only really slowly (increase n_iterations in the plot example to try it).

@naught101
Copy link
Author

Ok, the updater function was wrong (pulling the distance from x to the winning node, instead of to the node to be pulled). It's working now, and converging well. I'll see if I can improve the radius and alpha functions, but it actually works quite well now

One of the cluster comparison data sets (the one with three clusters) is pathological for a 2x2 SOM. It works fine with a 3x1 grid, but then the others don't work so well.

@naught101
Copy link
Author

Can someone please have a look at this? it's been nearly 2 weeks since I got it working. The test failure appears to be something to do with estimator cloning - something that I don't touch on. I have tried doing what I think the test is doing manually, and I can't replicate the problem (I don't know how AttributeError: 'NoneType' object has no attribute 'shape' is happening). If someone could give me a heads-up on why this might be happening, that would be very useful.

@@ -84,6 +84,8 @@
average_linkage = cluster.AgglomerativeClustering(linkage="average",
affinity="cityblock", n_clusters=2,
connectivity=connectivity)
som = cluster.SelfOrganizingMap(adjacency=(2, 2),
n_iterations= 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8

@ogrisel
Copy link
Member

ogrisel commented Apr 7, 2014

Also, I am still not convinced about the SOM usefulness. In the todo list the item:

  • investigate if SOM's usefulness on some of scikit-learn builtin (non-synthetic) datasets and showcase it as an example (new or updated existing example)

has been ticked. However then only on example on the digits data and the CH score is lower (which I think means that the clustering quality is not as good as k-means' according to the CH assumption. This does not really highlight the SOM method, especially as the fitting time is significantly higher.

However when using other metrics the SOM solution is not that bad. For instance Adjusted Rand Score computed with the true digits labels says that the quality of the SOM solution is on average similar to the k-means solution.

Please also include the SOM method in the cluster comparison example.

Similarly the other TODO item has also been ticked:

  • add some narrative documentation as a new section in the clustering chapter, in particular try to highlight pros and cons and typical domains where SOM is applied with success in the real world.

By reading the narrative document I still don't get any intuition as to where one would rather use SOM vs other clustering method by reading the examples or the documentation. What are the known applications where SOM has been shown to work well in practice?

Also out of curiosity, what is your personal interest in using SOM rather than simpler methods such as k-means?

Please untick those TODO items and address those comments first. As I said earlier, there is no point in implementing, reviewing and maintaining methods in scikit-learn if they cannot be clearly demonstrated to be practically useful and better than other simpler methods (at least for some problems) in the documentation and the examples.

To clarify my point, I do not state SOMs are practically useless (I don't know), I just want to be convinced they are not, as would the reader of the scikit-learn documentation would too.

@naught101
Copy link
Author

Please also include the SOM method in the cluster comparison example.
I have, see naught101@0f1b0de - this is why I ticked the second box.

However then only on example on the digits data and the CH score is lower (which I think means that the clustering quality is not as good as k-means' according to the CH assumption. This does not really highlight the SOM method, especially as the fitting time is significantly higher.

However when using other metrics the SOM solution is not that bad. For instance Adjusted Rand Score computed with the true digits labels says that the quality of the SOM solution is on average similar to the k-means solution.

This is a difficulty: The SOM theoretically should always perform worse than K-means. As discussed at #2892, it isn't the simple fitting that SOM is useful for, rather it is the fact that the SOM grid adds a semantic layer to the cluster layout. SOMs are basically useful for a) dimensionality reduction, where there are strong non-linear patterns in the data, and b) for using the semantic grid as a basis for creating small-multiple plots based on data from each cluster.

By reading the narrative document I still don't get any intuition as to where one would rather use SOM vs other clustering method by reading the examples or the documentation. What are the known applications where SOM has been shown to work well in practice?

I will see if I can come up with some examples that show these uses.

@ogrisel
Copy link
Member

ogrisel commented Apr 7, 2014

Thanks!

@naught101 naught101 closed this Apr 7, 2016
@naught101 naught101 deleted the self_organising_map branch April 7, 2016 04:08
@naught101 naught101 restored the self_organising_map branch April 7, 2016 04:08
@naught101 naught101 reopened this Apr 7, 2016
@naught101 naught101 force-pushed the self_organising_map branch from 5c4bff6 to 0cd1b8a Compare April 7, 2016 04:12
@amueller
Copy link
Member

I think this should go into contrib (looks like @naught101 gave up and then decided to not give up?).

@amueller
Copy link
Member

closing for now, feel free to reopen and argue. I think it would be nice to have in contrib, but not for master.

@amueller amueller closed this Oct 25, 2016
@naught101
Copy link
Author

I've been busy with other stuff, and haven't needed this module. Sorry that I haven't had time to fill out the examples. I was kind of hoping thinking a code review beforehand would be useful, as well as some indication that this would be useful in sklearn. I'm not too hung up on it being in the core package, but I'm worried that I wouldn't be able to adequately maintain it in contrib (due to lack of expertise in the theory, as well as code efficiency). But I guess that's as good an argument for keeping it out of core, too :)

@jnothman
Copy link
Member

I don't think having an unmaintained extension is much of an issue.

On 26 October 2016 at 11:38, naught101 [email protected] wrote:

I've been busy with other stuff, and haven't needed this module. Sorry
that I haven't had time to fill out the examples. I was kind of hoping
thinking a code review beforehand would be useful, as well as some
indication that this would be useful in sklearn. I'm not too hung up on it
being in the core package, but I'm worried that I wouldn't be able to
adequately maintain it in contrib (due to lack of expertise in the theory,
as well as code efficiency). But I guess that's as good an argument for
keeping it out of core, too :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2996 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz69uX40IprBOeZQqYFRG81jOfuKhEks5q3qEagaJpZM4BsNbC
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.